FacultyFinder: DA-IICT Faculty Data Pipeline

Project Overview

FacultyFinder is a robust Data Engineering pipeline designed to harvest, clean, validate, and serve faculty profiles from the DA-IICT website.

This project utilizes a Pure Selenium scraping engine to handle dynamic content rendering and extraction. Unlike hybrid approaches, this system uses Selenium's native XPath and CSS Selectors to navigate the DOM directly, consolidating data from multiple directories into a unified schema. The system includes a modular transformation layer, a data health analysis suite, and a FastAPI serving layer for downstream applications.

Key Features

Selenium-Native Extraction: Uses a headless browser for both navigation and data extraction (XPath/CSS), eliminating the need for external HTML parsers.
Smart Traversal: Handles varying HTML structures across Faculty, Adjunct, and Distinguished Professor pages using adaptive locators.
Modular Architecture: Splits responsibilities into ingestion, transformation, database management, and analysis.
Data Health & Analysis: Includes a standalone script (analysis.py) to generate statistical reports on data quality.
Multi-Format Storage: Persists data to SQLite (faculty.db), JSON (final_faculty_data.json), and CSV (final_faculty_data.csv) for maximum compatibility.
REST API: A high-performance FastAPI server (serving.py) that exposes the curated dataset.

Tech Stack

Language: Python 3.10+
Web Scraping: Selenium (WebDriver & Extraction)
Data Processing: Pandas, NumPy
Database: SQLite3
API Framework: FastAPI, Uvicorn

Setup & Installation

Clone the Repository

git clone [https://github.com/YOUR_USERNAME/faculty-assignment.git](https://github.com/YOUR_USERNAME/faculty-assignment.git)
cd faculty-assignment

Create a Virtual Environment

python -m venv venv
# Windows:
venv\Scripts\activate
# Mac/Linux:
source venv/bin/activate

Install Dependencies
```
pip install -r requirements.txt
```

Workflow & Usage

Step 1: Ingestion (Scrape & Transform)

Run the main ingestion script. This launches the Selenium driver, scrapes the target pages, applies transformations, and saves the data.

python ingestion.py

Process: Scrapes (Selenium) -> Cleans (via transformation.py) -> Saves to DB (via faculty_db.py) -> Exports CSV/JSON.
Output: faculty.db, final_faculty_data.json, final_faculty_data.csv.

Step 2: Quality Assurance (Data Analysis)

Run the analysis script to verify data integrity and view distribution metrics.

python analysis.py

Action: Loads the generated data and prints a health report (null counts, unique values, data types) to the console.

Step 3: Serving (Start API)

Launch the FastAPI server to expose the validated data.

uvicorn serving:app --reload

Action: Loads final_faculty_data.json into memory and starts a local web server.
Output: Server starts at http://127.0.0.1:8000.

Step 4: Verification

Open your web browser and navigate to the interactive API documentation to test the system.

Link: http://127.0.0.1:8000/docs

API Endpoints

Method	Endpoint	Description
`GET`	`/`	Health Check. Returns API status and total record count.
`GET`	`/faculty/all`	Bulk Fetch. Returns the complete dataset. (Used for vector embedding generation).
`GET`	`/faculty/search`	Search. specific faculty profiles based on query parameters.

Project Structure

FacultyFinder(Scraper)/
│
├── Scraper/                      # Core Logic Directory
│   ├── __pycache__/              # Local Python cache (ignored by git)
│   ├── ingestion.py              # Selenium-native extraction engine
│   ├── transformation.py         # Data cleaning and normalization
│   ├── faculty_db.py             # Database schema and CRUD logic
│   ├── analysis.py               # Data health and statistical reporting
│   └── serving.py                # FastAPI implementation for data access
│
├── .gitignore                    # Security and exclusion rules
├── README.md                     # Project documentation
└── requirements.txt              # List of dependencies (Selenium, FastAPI, etc.)

Dataset Statistics & Analysis

Attribute	Value
Total Observations	112
Total Columns	10
Data Types	`int64` (1), `object` (9)
Memory Usage	8.9+ KB

Column Schema & Completeness

#	Column Name	Non-Null Count	Data Type	Status
0	id	112	`int64`	Full
1	name	112	`object`	Full
2	designation	110	`object`	Minor Missing
3	email	112	`object`	Full
4	bio	72	`object`	High Missing
5	research	112	`object`	Full
6	publications	63	`object`	High Missing
7	teaching	72	`object`	High Missing
8	specialization	109	`object`	Minor Missing
9	profile_url	112	`object`	Full

MISSING DATA BREAKDOWN

Column	Missing (NaN)	Total Empty	% Complete
id	0	0	100.0%
name	0	0	100.0%
designation	2	2	98.2%
email	0	0	100.0%
bio	40	40	64.3%
research	0	0	100.0%
publications	49	49	56.2%
teaching	40	40	64.3%
specialization	3	3	97.3%
profile_url	0	0	100.0%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FacultyFinder: DA-IICT Faculty Data Pipeline

Project Overview

Key Features

Tech Stack

Setup & Installation

Workflow & Usage

Step 1: Ingestion (Scrape & Transform)

Step 2: Quality Assurance (Data Analysis)

Step 3: Serving (Start API)

Step 4: Verification

API Endpoints

Project Structure

Dataset Statistics & Analysis

Column Schema & Completeness

MISSING DATA BREAKDOWN

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Scraper		Scraper
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

FacultyFinder: DA-IICT Faculty Data Pipeline

Project Overview

Key Features

Tech Stack

Setup & Installation

Workflow & Usage

Step 1: Ingestion (Scrape & Transform)

Step 2: Quality Assurance (Data Analysis)

Step 3: Serving (Start API)

Step 4: Verification

API Endpoints

Project Structure

Dataset Statistics & Analysis

Column Schema & Completeness

MISSING DATA BREAKDOWN

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages