CPP Biology Faculty Search Engine

Overview

This project implements a specialized search engine for the Cal Poly Pomona Biology Department's faculty. It crawls faculty web pages, processes the information, and provides a sophisticated search interface to help users find relevant faculty members based on their research interests, expertise, and other attributes.

Features

Web crawling of faculty pages with intelligent target identification
Structured data extraction of faculty profiles
Advanced text processing with lemmatization
TF-IDF based search with cosine similarity ranking
Spell checking for search queries
Paginated search results with clickable URLs
MongoDB-based data persistence

System Architecture

The system consists of five main components:

Web Crawler (Crawler.py)
- Crawls the Biology department website
- Identifies and extracts faculty pages
- Stores raw HTML content in MongoDB
Faculty Parser (facultyParser.py)
- Extracts structured information from faculty pages
- Processes main content and navigation sections
- Stores parsed data in MongoDB
Text Processor (Lemmatizer.py)
- Implements text normalization using spaCy
- Preserves important information like phone numbers
- Enhances search accuracy through lemmatization
Index Generator (IndexAndEmbeddingsGeneration.py)
- Creates TF-IDF vectors for faculty documents
- Builds inverted index for efficient searching
- Generates and stores document embeddings
Search Engine (SearchEngine.py)
- Provides interactive search interface
- Implements spell checking and query processing
- Ranks results using cosine similarity

Prerequisites

Python 3.x
MongoDB

Required Python packages:

beautifulsoup4
pymongo
regex
pyspellchecker
spacy
scikit-learn
numpy

spaCy's English language model:

python -m spacy download en_core_web_lg

Installation

Clone the repository:

git clone [repository-url]
cd CS5180-finalproject

Install required packages:

pip install beautifulsoup4 pymongo regex pyspellchecker spacy scikit-learn numpy

Download spaCy's English language model:
```
python -m spacy download en_core_web_lg
```
Ensure MongoDB is running locally on the default port (27017)

Usage

Run the components in the following order:

Start the web crawler:
```
python Crawler.py
```
Parse faculty information:
```
python facultyParser.py
```
Process text with lemmatization:
```
python Lemmatizer.py
```
Generate search indices:
```
python IndexAndEmbeddingsGeneration.py
```
Start the search engine:
```
python SearchEngine.py
```

Search Interface Features

Natural language queries
Spell check suggestions
Paginated results (5 results per page)
Navigation options:
- Next/Previous page
- Run new query
- Quit

Data Storage

The system uses MongoDB with the following collections:

CrawledPages: Raw HTML content
FacultyInfo: Structured faculty data
InvertedIndex: Search index data
Embeddings: TF-IDF vectors

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributors

[Your Name]
[Other Contributors]

Acknowledgments

Cal Poly Pomona Biology Department
CS5180 Information Retrieval Course

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CPP Biology Faculty Search Engine

Overview

Features

System Architecture

Prerequisites

Installation

Usage

Search Interface Features

Data Storage

License

Contributors

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Crawler.py		Crawler.py
IndexAndEmbeddingsGeneration.py		IndexAndEmbeddingsGeneration.py
LICENSE		LICENSE
Lemmatizer.py		Lemmatizer.py
README.md		README.md
SearchEngine.py		SearchEngine.py
facultyParser.py		facultyParser.py
tfidf_vectorizer.pkl		tfidf_vectorizer.pkl

License

shreyas463/BioSearch-CPP

Folders and files

Latest commit

History

Repository files navigation

CPP Biology Faculty Search Engine

Overview

Features

System Architecture

Prerequisites

Installation

Usage

Search Interface Features

Data Storage

License

Contributors

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages