Question-Answering Model Using OpenAI Embeddings

In this project, I built a model that "generates" answers to questions given a context, such as a pdf document.

I used OpenAI's Embeddings API along with their embedding model and completion model.

Compared to finetuning LLMs on question-answering (such as https://huggingface.co/sooolee/roberta-base-finetuned-squad-v1), this approach allows building the model fast without finetuning, and also the model can take a much larger context data.

"text-embedding-ada-002" is the embedding model that converts each DataFrame row into an embedding.
"text-davinci-003" is the completion model that will generate answers based on the the best matches (cosign similarities) between a question and context.

Demo

🚧 The demo is on the way! 🚧

Convert a pdf file into a DataFrame that the embedding model can process. Break down the entire document into sub-texts, where a good chunk of overlapping between neighboring sub-texts is captured in order not to lose the infomation around the breaking point.

Note: The embedding model can take up to 8192 tokens in length per single request. This means each sub-text can take up to this size. My model has much smaller size of sub-texts.

Compute embedding for each row of the DataFrame using the embedding model, returning a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
Compare similarities between document and query. Calculate the cosine similarities between two vectors (embeddings of a query and each row (of the doc DataFrame)). Sort the scores in order of the highest similarities.
Construct the prompt with sections with highest similarities: The completion model needs a specific instruction for each question along with the context. Instruction also includes the instruction in case of no match. Context means the sub-texts with highest similarities.
Use the completion model to get the answer! The model takes a context (a document in pdf), a query (your question), and the embeddings dictionary (generated from Step 2 above).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
sample_doc		sample_doc
.gitignore		.gitignore
README.md		README.md
openai-text-davinci-003-qa.ipynb		openai-text-davinci-003-qa.ipynb