-
Notifications
You must be signed in to change notification settings - Fork 317
Open
Labels
Description
🧩 Add Support for Ingesting Unstructured Data (Hacktoberfest 2025)
🎯 Goal
Enable ingestion from unstructured sources such as raw text, documents (PDF, DOCX), and URLs.
This will allow users to feed varied data formats directly into Memori’s memory system.
📋 Description
Currently, Memori supports structured data ingestion only.
To make the memory engine more versatile, we need to add a module that can handle unstructured inputs —
including plain text, document files, and website content.
Contributors can help by implementing or improving one or more of the following tasks:
- Add a function to extract text from URLs (using
requestsorBeautifulSoup). - Add a parser for text-based documents (PDF/DOCX/TXT).
- Normalize extracted text and feed it into the existing memory ingestion pipeline.
- Write unit tests to validate ingestion and parsing results.
✅ Acceptance Criteria
- Support for at least two unstructured data sources (e.g., PDF and URL).
- Code is clean, modular, and follows project structure and linting rules.
- Includes unit tests with valid input/output examples.
- Documentation updated (README or
/docssection).
💡 Tech Notes
- Language: Python
- Recommended libraries:
requests,beautifulsoup4,pypdf - Design goal: Keep the ingestion process modular for future extensions (e.g., images, audio).
🤝 Hacktoberfest Details
- Labels: hacktoberfest, bug, help wanted
- This issue is part of Hacktoberfest 2025 — valid pull requests will be merged or labeled hacktoberfest-accepted.
- Please review the CONTRIBUTING.md for contribution guidelines.
- Follow our Code of Conduct to maintain a positive and inclusive environment.
⭐ Don’t forget to star the repo on GitHub. It really helps our community grow!