Creating a Surname Lexicon for Historical US Records Skip to main content
Utah's Foremost Platform for Undergraduate Research Presentation
2024 Abstracts

Creating a Surname Lexicon for Historical US Records

Authors: Spencer Timmerman
Mentors: Joseph Price
Insitution: Brigham Young University

We develop a method for creating a lexicon of all correctly spelled surnames in historical US records. We focus specially on the full-count 1850-1940 census records which include over 10 million unique spellings in the surname field. We use three steps to create this lexicon. First, we use links across multiple census records for the same individuals and use these links to identify spellings of the same surname. Second, we use data from a large genealogical website to help identify the correct surnames for each person and convert this into training data. Third, we develop a machine-learning approach that uses the frequency of surnames across different record collections to identify a lexicon of correctly-spelled surnames. Our final lexicon of correctly-spelled surnames only includes 500,000 of the 10 million unique found in US census records. We also provide a crosswalk that maps the majority of incorrect surnames into a unique surname in the lexicon.