Skip to content

Probate Parsing #7

@benwbrum

Description

@benwbrum

Free UK Genealogy will be launching a new project to expose genealogical information from wills and probate books. These books record the date and location of people's deaths, their occupations, and often the same information about the family members that executed the wills.

In previous projects, all this material was transcribed manually by volunteers, as the source documents were handwritten. The probate books are different, however, in that they are printed and thus are accessible to OCR. We should be able to use OCR text to seed a database by parsing the text for names, dates, occupations, and relationships. We should also be able to use OCR bounding box coordinates to associate regions of a scanned page with an entry for presentation to researchers or for correction by volunteers.

Sample data for this project:

  • 1873 Probate G-I Source Images directory containing scanned images of the probate books.
  • Raw OCR directory containing OCR output from tesseract against the images in the source images directory.
  • Gold-standard CSV file containing manually-extracted entries, their bounding boxes, and the participants in each entry. This is the kind of output we're looking for, though we're interested in occupations as well.
  • Derivative Images directory drawing OCR bounding boxes onto source images, and a ruby script used to create the derivative images.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions