We have been able to extract all of the personal names, corporate bodies, family names, genre/form terms, geographic place names, and subjects used in the controlAccess section of over 4000 EAD files from Special Collections. After extracting this data, we de-duplicated it and got a count of the number of times each heading has been used.
The personal names and corporate bodies were then run through a script in OpenRefine to reconcile the data against the Library of Congress Name Authority file (LCNAF). When a match was found, the authorized form of the name was included as well as a URI (Universal Resource Identifier) to the Linked Data version of the authority record. The LC heading was then compared to the original entry and notes were made where there was a potential change to the authorized form of the heading. (Note: Occasionally, some false matches were made, so this list should not be considered authoritative. If you find an false match, please leave a comment in the Google Sheet so we can correct it.)
This reconciliation process will also be completed for Library of Congress Subject Headings (LCSH). Any subject that was from the ArchivesWest controlled vocabulary was matched against that list and potential problems are noted.
This data is contained in two different formats. The first Google Sheet has different tabs for all headings in all EAD files for each of the different types of headings (corpname, persname, subject, etc.). The second spreadsheet contains the same data, except it is separated by the different departments within Special Collections (a, acc, accn, ms).