An AI-powered research assistant that learns your preferences and delivers personalized ArXiv paper recommendations
ArXiv PaperLens is a sophisticated RAG (Retrieval-Augmented Generation) system that revolutionizes how researchers discover and interact with academic papers. Using advanced machine learning techniques, it creates personalized research experiences by understanding user preferences through implicit and explicit feedback mechanisms.
- ๐ค Intelligent Paper Discovery: Daily ArXiv crawling with ML-powered recommendations
- ๐ง Multimodal RAG Chat: Converse with papers using text and images via Google Gemini Pro
- ๐ Advanced User Profiling: Dynamic embedding updates using exponential moving average algorithms
- ๐ฏ Personalized Recommendations: Adaptive scoring system based on user interaction patterns
- ๐ Semantic Search: FAISS-powered vector similarity search across 100k+ papers
- ๐ฑ Multi-Channel Notifications: Email digests and Telegram bot integration
- ๐จ Modern UI: Responsive React frontend with real-time interactions
graph TB
subgraph "Data Layer"
A[ArXiv API] --> B[Daily Crawler Agent]
B --> C[PDF Processing]
C --> D[Text & Image Extraction]
D --> E[Vector Database<br/>FAISS Index]
end
subgraph "AI Layer"
F[Multimodal Embedder<br/>CLIP + BGE] --> G[User Embedding Service]
G --> H[Recommendation Engine<br/>Exponential Moving Average]
I[Gemini Pro RAG Agent] --> J[Multimodal Q&A]
end
subgraph "Application Layer"
K[Backend] --> L[React Frontend]
M[LangGraph Agents] --> N[Notification Service<br/>Email + Telegram]
end
E --> I
E --> H
G --> K
H --> K
J --> L
N --> L
The system employs an Exponential Moving Average (EMA) approach for updating user embeddings:
E_new = (1 - ฮฑ) ร E_current + ฮฑ ร E_weighted_papers
Where:
ฮฑ
: Learning rate (default: 0.1)E_weighted_papers
: Weighted average of paper embeddings based on interaction types- Interaction weights: Like(+1.0), Bookmark(+0.8), Share(+0.6), View(+0.1), Dislike(-0.5), Delete(-0.9)
E_decayed = E ร (decay_factor^(days_since_update/30))
This ensures recent preferences have higher influence while preventing embedding staleness.
Paper relevance is computed using cosine similarity:
relevance = (cosine_similarity(E_user, E_paper) + 1) / 2
Normalized to [0,1] range for intuitive scoring.
- Python 3.9+
- Node.js 16+ (for React frontend)
- Google API Key (Gemini Pro)
- PostgreSQL (for metadata storage)
- Docker (optional deployment)
-
Clone and Setup
git clone https://github.com/aymen-000/PaperLens cd Paperlans # Install Python dependencies pip install -r requirements.txt # or using uv (recommended) uv sync
-
Environment Configuration
# Create .env file cat > .env << EOF GOOGLE_API_KEY = "your_key" CRAWLER_AGENT_MODEL_ID="gemini-2.5-flash" PROVIDER = "langchain_google_genai.ChatGoogleGenerativeAI" HUGGINGFACE_TOKEN_KEY = "your_token" JWT_SECRET_KEY = "your_key"
-
Database Initialization
python scripts/init_db.py python scripts/seed_user.py
-
Start Services
# Terminal 1: Backend API source .venv/bin/activate export PYTHONAPATH=. python3 backend/app/app.py # Terminal 2: Frontend (separate terminal) npm install npm run dev
-
Access Application
- Frontend: http://localhost:3000
- Health Check: http://localhost:8000/health
- Flask: High-performance async web framework
- LangChain: LLM orchestration and document processing
- LangGraph: Agent workflow management
- FAISS: Vector similarity search (Facebook AI)
- PostgreSQL: Metadata and user data storage
- SQLAlchemy: Database ORM
- Google Gemini Pro: Multimodal language model
- CLIP (OpenAI): Image-text understanding
- BGE Embeddings: Semantic text embeddings
- Sentence Transformers: Text encoding
- PyTorch: Deep learning framework
- React: Modern UI framework
- Tailwind CSS: Utility-first styling
- Axios: HTTP client
Paperlens/
โโโ agents/ # ๐ค AI Agent System
โ โโโ system_agents/ # Core intelligent agents
โ โ โโโ crawler.py # ArXiv paper crawler
โ โ โโโ papers_rag.py # Multimodal RAG system
โ โโโ data/ # Data processing & embeddings
โ โ โโโ embedding.py # User embedding service
โ โ โโโ indexing.py # Vector indexing
โ โ โโโ vector_db.py # FAISS operations
โ โโโ tools/ # Agent utility tools
โ โ โโโ crawler_tools.py # PDF processing tools
โ โ โโโ rag_tools.py # RAG helper functions
โ โโโ prompts/ # LLM prompt templates
โ โโโ config.py # Agent configurations
โ
โโโ backend/ # โก FastAPI Application
โ โโโ app/
โ โโโ models/ # Database models
โ โ โโโ user.py # User profiles
โ โ โโโ paper.py # Paper metadata
โ โ โโโ user_embedding.py # User preference vectors
โ โ โโโ user_feedback.py # Interaction tracking
โ โโโ routes/ # API endpoints
โ โ โโโ papers_api.py # Paper CRUD operations
โ โ โโโ papers_bot.py # RAG chat endpoints
โ โ โโโ user.py # User management
โ โโโ services/ # Business logic
โ โ โโโ db_service.py # Database operations
โ โ โโโ handle_interaction.py # User feedback processing
โ โโโ database.py # SQLAlchemy setup
โ
โโโ frontend/ # ๐จ Next.js Application
โ โโโ app/ # Next.js App Router
โ โ โโโ page.tsx # Dashboard homepage
โ โ โโโ login/ # Authentication pages
โ โ โโโ signup/
โ โโโ components/ # UI components
โ โ โโโ paper-feed.tsx # Paper recommendation feed
โ โ โโโ rag-panel.tsx # Chat interface
โ โ โโโ paper-search.tsx # Search functionality
โ โ โโโ settings-page.tsx # User preferences
โ โ โโโ ui/ # Shadcn/UI components
โ โโโ lib/ # Utilities & API clients
โ
โโโ faiss_index/ # ๐ Vector Database
โ โโโ text_index.faiss # Text embeddings (BGE)
โ โโโ image_index.faiss # Image embeddings (CLIP)
โ โโโ faiss_index/ # Legacy unified index
โ
โโโ storage/ # ๐ File Storage
โ โโโ raw/ # Original PDF papers
โ โโโ processed/ # Extracted content
โ โ โโโ images/ # Figures & diagrams
โ โ โโโ paper_*/ # Per-paper text & images
โ โโโ papers/ # Downloaded PDFs
โ
โโโ scripts/ # ๐ ๏ธ Utility Scripts
โโโ init_db.py # Database initialization
โโโ seed_user.py # Sample user creation
โโโ run_agents.py # Agent orchestration
- Purpose: Automated daily paper discovery
- Capabilities:
- Fetches 50+ papers daily based on user categories
- PDF text extraction using PyPDF2
- Figure/diagram extraction using PIL
- Metadata enrichment and storage
- Purpose: Intelligent Q&A over research papers
- Capabilities:
- Text-based semantic search
- Image understanding and analysis
- Context-aware response generation
- Source attribution and citation
- Purpose: Preference modeling and adaptation
- Capabilities:
- Real-time embedding updates
- Interaction pattern analysis
- Cold-start problem handling
- Temporal preference drift detection
- Purpose: Personalized content delivery
- Capabilities:
- Multi-factor scoring algorithms
- Diversity-aware recommendations
- Category-based filtering
- Performance analytics
# User category preferences with weights
CATEGORY_WEIGHTS = {
"cs.AI": 0.2, # Artificial Intelligence
"cs.LG": 0.2, # Machine Learning
"cs.CV": 0.2, # Computer Vision
"stat.ML": 0.4, # Statistics ML
}
We welcome contributions!
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature
) - Make changes with tests
- Commit with conventional commits (
feat:
,fix:
,docs:
) - Push and create Pull Request
- ๐ New embedding models integration
- ๐ Advanced analytics dashboards
- ๐ Multi-language support
- โก Performance optimizations
- ๐จ UI/UX improvements
- ๐ User feedbacks
- ๐ notification channels
This project is licensed under the MIT License - see the LICENSE file for details.
- ArXiv for providing open access to research papers
- Google for Gemini Pro API access
- Hugging Face for transformer models and embeddings
- Facebook AI for FAISS vector search
- OpenAI for CLIP multimodal understanding
- ๐ฐ Reinforcement Learning Recommender
- Multi-Armed Bandit algorithms for exploration vs exploitation
- Deep Q-Network (DQN) for long-term user engagement optimization
- Contextual bandits considering user state, time, and reading patterns
- A/B testing framework for recommendation strategy evaluation
-
๐ Rich Feedback Mechanisms
- Star ratings and detailed paper reviews
- Reading time tracking and attention heatmaps
- Bookmark organization with custom tags and collections
- Social features: following researchers, sharing reading lists
- Citation network analysis for impact-based recommendations
-
๐ฌ Advanced Analytics Dashboard
- Personal research journey visualization
- Topic evolution and trend analysis
- Collaboration opportunity detection
- Research gap identification using knowledge graphs
-
๐ Diversified Content Sources
- LinkedIn Research Posts: Professional insights and industry research
- Twitter/X Academic Threads: Real-time research discussions and preprints
- Google Scholar: Citation networks and h-index tracking
- Research Gate: Social academic networking integration
- Medium/Towards Data Science: Practical implementations and tutorials
- GitHub Research Repos: Code implementations and reproducible research
-
๐ก Social Media Intelligence
- Tweet sentiment analysis for trending topics
- LinkedIn post engagement metrics
- Research influencer identification
- Conference hashtag monitoring (#NeurIPS2024, #ICML2024)
-
๐ค Advanced AI Capabilities
- Literature gap analysis using LLMs
- Automated research proposal generation
- Cross-paper concept linking and knowledge graphs
- Research methodology recommendations
- Collaborative filtering with similar researchers
-
๐ Research Workflow Integration
- Zotero/Mendeley synchronization
- LaTeX reference management
- Notion/Obsidian knowledge base integration
- Calendar integration for reading schedules
- Email digest with personalized research summaries
- ๐ Issues: GitHub Issues
- ๐ฌ Discussions: GitHub Discussions
- ๐ง Email: [email protected]
Built with โค๏ธ for the research community
Making academic research discovery intelligent, personalized, and delightful