Skip to content

Dataset

Jaxter2017 edited this page Feb 2, 2023 · 28 revisions

Dataset Preparation

The raw data source compromises of 350 maths papers sat as a mock exam by a Year 11 cohort. It is the 2022 Edexcel GCSE Maths Paper 1 originally sat in the summer. I chose to focus on a question I judged to be most difficult for a model to accurately mark/provide feedback for based on the number of marks it was worth, variability of possible answers and its requirement for all working out to be shown.

I have taken a series of steps to prepare this raw handwritten data for model usage. See the examples below for Q3 of the Higher Paper:

dataframe_snip
  1. Manually aligned student working so it follows vertically and rewrote certain working out methods in a simple form to aid handwriting conversion model (Notability Conversion)
  2. Converted this to LaTeX using Mathpix (Rendered + Raw LaTeX)
  3. Cleaned LaTeX to simple mathematical symbols programmatically (Cleaned Text)
Clone this wiki locally