This project implements a specialized search engine for the Cal Poly Pomona Biology Department's faculty. It crawls faculty web pages, processes the information, and provides a sophisticated search interface to help users find relevant faculty members based on their research interests, expertise, and other attributes.
- Web crawling of faculty pages with intelligent target identification
- Structured data extraction of faculty profiles
- Advanced text processing with lemmatization
- TF-IDF based search with cosine similarity ranking
- Spell checking for search queries
- Paginated search results with clickable URLs
- MongoDB-based data persistence
The system consists of five main components:
-
Web Crawler (
Crawler.py
)- Crawls the Biology department website
- Identifies and extracts faculty pages
- Stores raw HTML content in MongoDB
-
Faculty Parser (
facultyParser.py
)- Extracts structured information from faculty pages
- Processes main content and navigation sections
- Stores parsed data in MongoDB
-
Text Processor (
Lemmatizer.py
)- Implements text normalization using spaCy
- Preserves important information like phone numbers
- Enhances search accuracy through lemmatization
-
Index Generator (
IndexAndEmbeddingsGeneration.py
)- Creates TF-IDF vectors for faculty documents
- Builds inverted index for efficient searching
- Generates and stores document embeddings
-
Search Engine (
SearchEngine.py
)- Provides interactive search interface
- Implements spell checking and query processing
- Ranks results using cosine similarity
- Python 3.x
- MongoDB
- Required Python packages:
beautifulsoup4 pymongo regex pyspellchecker spacy scikit-learn numpy
- spaCy's English language model:
python -m spacy download en_core_web_lg
-
Clone the repository:
git clone [repository-url] cd CS5180-finalproject
-
Install required packages:
pip install beautifulsoup4 pymongo regex pyspellchecker spacy scikit-learn numpy
-
Download spaCy's English language model:
python -m spacy download en_core_web_lg
-
Ensure MongoDB is running locally on the default port (27017)
Run the components in the following order:
-
Start the web crawler:
python Crawler.py
-
Parse faculty information:
python facultyParser.py
-
Process text with lemmatization:
python Lemmatizer.py
-
Generate search indices:
python IndexAndEmbeddingsGeneration.py
-
Start the search engine:
python SearchEngine.py
- Natural language queries
- Spell check suggestions
- Paginated results (5 results per page)
- Navigation options:
- Next/Previous page
- Run new query
- Quit
The system uses MongoDB with the following collections:
CrawledPages
: Raw HTML contentFacultyInfo
: Structured faculty dataInvertedIndex
: Search index dataEmbeddings
: TF-IDF vectors
This project is licensed under the MIT License - see the LICENSE file for details.
- [Your Name]
- [Other Contributors]
- Cal Poly Pomona Biology Department
- CS5180 Information Retrieval Course