-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Free UK Genealogy will be launching a new project to expose genealogical information from wills and probate books. These books record the date and location of people's deaths, their occupations, and often the same information about the family members that executed the wills.
In previous projects, all this material was transcribed manually by volunteers, as the source documents were handwritten. The probate books are different, however, in that they are printed and thus are accessible to OCR. We should be able to use OCR text to seed a database by parsing the text for names, dates, occupations, and relationships. We should also be able to use OCR bounding box coordinates to associate regions of a scanned page with an entry for presentation to researchers or for correction by volunteers.
Sample data for this project:
- 1873 Probate G-I Source Images directory containing scanned images of the probate books.
- Raw OCR directory containing OCR output from
tesseract
against the images in the source images directory. - Gold-standard CSV file containing manually-extracted entries, their bounding boxes, and the participants in each entry. This is the kind of output we're looking for, though we're interested in occupations as well.
- Derivative Images directory drawing OCR bounding boxes onto source images, and a ruby script used to create the derivative images.