Author(s): Jacob Leavitt, Josh Christensen, Jake McCoy
Mentor(s): Joseph Price, Mark Clement
Institution BYU
Digitizing 20th-century French census records offers a valuable resource for understanding demographic shifts, particularly within Paris during significant historical events such as the Thirty Years Crisis (including WWI, WWII, and the Great Depression), and the Industrial Revolution. These records provide insights into the migration patterns of Jewish populations during WWII, revealing how the war impacted Jewish communities and broader population movements. Additionally, they help researchers analyze the transformation of Paris’s neighborhoods over time, examining development, decline, and gentrification trends. This data is crucial for studying the experiences of cultural, political, and religious minorities, capturing a pivotal period in European history and providing a foundation for future demographic and sociological research. To digitize these records with deep learning models, researchers face the challenge of developing a dataset that allows machine reading of handwritten French census entries. Traditionally, this would require extensive manual labeling of thousands of images, a costly and time-consuming task. Instead, synthetic data generation is used to create a dataset of French words for training the model. By synthesizing labeled data, researchers reduce the need for labor-intensive labeling while still achieving meaningful training outcomes. However, for fields where higher accuracy is critical, active learning is employed, engaging BYU Pathways students, many of whom are proficient in French, to correct the model's errors. This active learning approach not only improves the model’s accuracy but also offers students a valuable work opportunity. Initial results from the model show strong performance, with birth year fields reaching 67% word accuracy and 87% character accuracy after training solely on synthetic data and transfer learning from the Iowa 1915 census. However, more complex fields had only 39% word accuracy and 49% character accuracy after training only on synthetic data. By applying active learning with BYU Pathways students, the model achieves nearly 100% accuracy across fields, with a fraction of the manual data labeling required by traditional methods. This approach underscores the potential of introducing synthetic data training to traditional transfer learning and active learning to efficiently train high-accuracy models, enhancing historical research capabilities and creating robust tools for analyzing handwritten records.