Skip to content

dimzachar/DataTalksClub-Projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

118 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DataTalksClub-Projects

Streamlit App

DataTalksClub.Streamlit.mp4

Table of Contents

Introduction

DataTalksClub-Projects automates the analysis of projects from DataTalksClub courses. It scrapes project submissions, generates descriptive titles using LLMs, and classifies deployment types (Batch/Streaming) and cloud providers (GCP/AWS/Azure).

Supported courses:

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           DataTalksClub Website                              β”‚
β”‚                    courses.datatalks.club/*/projects                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
                                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         1. SCRAPE & DISCOVER                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚  β”‚ Course Discovery │───▢│  Web Scraping   │───▢│  CSV Generation β”‚          β”‚
β”‚  β”‚ (Auto-detect new β”‚    β”‚ (BeautifulSoup) β”‚    β”‚ (project URLs)  β”‚          β”‚
β”‚  β”‚  finished courses)β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
                                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         2. MULTI-FILE FETCHING                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚  β”‚   GitHub API    │───▢│  Repo Analyzer  │───▢│  Key Files:     β”‚          β”‚
β”‚  β”‚  (Tree + Files) β”‚    β”‚ (Prioritization)β”‚    β”‚  β€’ README.md    β”‚          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  β€’ docker-composeβ”‚          β”‚
β”‚                                                 β”‚  β€’ *.tf (Terraform)β”‚        β”‚
β”‚         Parallel fetching with ThreadPool       β”‚  β€’ requirements.txtβ”‚        β”‚
β”‚         (5 workers default, configurable)       β”‚  β€’ Dockerfile    β”‚          β”‚
β”‚                                                 β”‚  β€’ dags/*.py     β”‚          β”‚
β”‚                                                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
                                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      3. LLM CLASSIFICATION & TITLE GEN                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚  β”‚  OpenRouter API │───▢│ Classification  │───▢│ Title Generationβ”‚          β”‚
β”‚  β”‚ (Free LLM tier) β”‚    β”‚ β€’ Deployment    β”‚    β”‚ (Domain-focused,β”‚          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   Type          β”‚    β”‚  tech-accurate) β”‚          β”‚
β”‚                         β”‚ β€’ Cloud Providerβ”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                  β”‚
β”‚                                                                              β”‚
β”‚  Classification runs FIRST β†’ Title uses deployment context                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
                                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                            4. OUTPUT                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚  Data/{course}/{year}/data.csv                               β”‚            β”‚
β”‚  β”‚  β”œβ”€β”€ project_url                                             β”‚            β”‚
β”‚  β”‚  β”œβ”€β”€ project_title    (LLM-generated, domain-specific)       β”‚            β”‚
β”‚  β”‚  β”œβ”€β”€ Deployment Type  (Batch, Streaming, Web Service)        β”‚            β”‚
β”‚  β”‚  β”œβ”€β”€ Reason           (Evidence from code files)             β”‚            β”‚
β”‚  β”‚  └── Cloud            (GCP, AWS, Azure, Other, Unknown)      β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

How It Works

Pipeline Steps

  1. Course Discovery - Automatically detects finished courses from DataTalksClub website
  2. Web Scraping - Extracts project submission URLs from course pages
  3. Multi-File Fetching - For each GitHub repo, fetches 10 key files (not just README):
    • docker-compose.yml β†’ Shows Kafka, Spark, orchestrators
    • *.tf files β†’ Definitive cloud provider indicator
    • dags/*.py β†’ Airflow = Batch
    • requirements.txt β†’ Dependencies
    • Dockerfile, Makefile, etc.
  4. LLM Classification - Analyzes actual code to determine:
    • Deployment Type: Batch (Airflow, Kestra, Mage) or Streaming (Kafka, Flink)
    • Cloud Provider: GCP, AWS, Azure based on Terraform/SDK usage
  5. Title Generation - Creates descriptive titles based on:
    • Actual project functionality (not repo name)
    • Deployment type context (no "Real-Time" for Batch projects)
    • Domain focus (e.g., "NYC Taxi Analytics Pipeline")

Key Features

  • Parallel Processing: 5-10x faster with configurable workers
  • Smart Skipping: Only processes new courses by default
  • Multi-File Context: Better accuracy than README-only analysis
  • Course-Specific Types: Each course has valid deployment types
  • Nested Project Support: Handles /tree/main/project URLs correctly

Performance

Metric Before (Sequential) After (Parallel, 5 workers)
381 projects ~60 minutes ~12-15 minutes
Throughput ~0.1 proj/sec ~0.5 proj/sec

Getting Started

Prerequisites

  • Docker and Docker Compose
  • GitHub Personal Access Token (create one) - for pipeline
  • OpenRouter API Key (get free tier) - for pipeline

Setup

git clone https://github.com/dimzachar/DataTalksClub-Projects.git
cd DataTalksClub-Projects

# For pipeline: copy and edit .env
cp .env.example .env

# Build Docker image
make docker-build

View the Dashboard

The easiest way - just visit the live app: datatalksclub-projects.streamlit.app

Or run locally with Docker:

Make Command Direct Docker Command
docker-compose up streamlit Same

Then open http://localhost:8501


Run the Data Pipeline

Docker Commands (Recommended)

Make Command Direct Docker Command Description
make docker-build docker-compose build Build Docker image (run once)
make docker-discover docker-compose run --rm pipeline python -m src.pipeline_runner --discover See available courses
make docker-pipeline docker-compose run --rm pipeline python -m src.pipeline_runner Process new courses only
make docker-pipeline-all docker-compose run --rm pipeline python -m src.pipeline_runner --all Reprocess all courses
make docker-pipeline-single COURSE=dezoomcamp YEAR=2025 docker-compose run --rm pipeline python -m src.pipeline_runner --course dezoomcamp --year 2025 Process specific course
make docker-pipeline-project COURSE=dezoomcamp YEAR=2025 PROJECT_URL=https://github.com/example-org/example-de-project docker-compose run --rm pipeline python -m src.pipeline_runner --course dezoomcamp --year 2025 --project-url https://github.com/example-org/example-de-project Process only one project URL
make docker-pipeline WORKERS=3 docker-compose run --rm pipeline python -m src.pipeline_runner --workers 3 Process new courses with custom worker count
make docker-pipeline-test COURSE=dezoomcamp YEAR=2025 LIMIT=10 docker-compose run --rm pipeline python -m src.pipeline_runner --course dezoomcamp --year 2025 --limit 10 Test with limited projects
- docker-compose run --rm pipeline python -m src.pipeline_runner --retry-unknowns Retry Unknown rows across all existing courses

Pipeline Options

Option Description
--discover List available courses and their status
--all Reprocess all courses (overwrites existing)
--retry-unknowns Re-run classify step on all existing courses to fix unknowns
--course NAME Process specific course
--year YEAR Process specific year
--project-url URL Process only one repo URL (requires --course and --year)
--limit N Limit to N projects (for testing)
--workers N Parallel workers (default: 5), pass via WORKERS=N in make

Testing

Make Command Direct Docker Command Description
make docker-test docker-compose run --rm pipeline python -m pytest tests/ -v Run all tests in Docker
make docker-test-cov docker-compose run --rm pipeline python -m pytest tests/ -v --cov=... Run tests with coverage in Docker
make test python -m pytest tests/ -v Run all tests locally
make test-cov python -m pytest tests/ -v --cov=... Run tests with coverage locally
make test-unit - Run unit tests only
make test-e2e - Run E2E/integration tests only

Quality Checks

Make Command Direct Docker Command Description
make quality_checks - Run isort, black, pylint locally
make docker-quality-checks docker-compose run --rm pipeline python -m isort . && ... Run isort, black, pylint in Docker

Local Development (without Docker)

Requires Python 3.11. Python 3.12+ has dependency issues.

Setup with uv

uv venv --python 3.11
.venv\Scripts\activate      # Windows
source .venv/bin/activate   # Linux/Mac
uv pip install -r requirements.txt

Setup with pip

python3.11 -m venv .venv
.venv\Scripts\activate      # Windows
source .venv/bin/activate   # Linux/Mac
pip install -r requirements.txt

Local Commands

Make Command Description
make streamlit Run Streamlit dashboard
make pipeline Process new courses
make pipeline-all Reprocess all courses
make pipeline-discover Show available courses
make pipeline-single COURSE=dezoomcamp YEAR=2025 Process single course
make pipeline-project COURSE=dezoomcamp YEAR=2025 PROJECT_URL=https://github.com/example-org/example-de-project Process single project URL

Output Data

Generated data is saved to Data/{course}/{year}/data.csv:

Column Description Example
project_url GitHub repository URL https://github.com/user/repo
project_title LLM-generated title NYC Taxi Fare Analytics Pipeline
Deployment Type Pipeline type Batch, Streaming, Batch, Streaming
Reason Classification evidence Found Airflow DAG in dags/pipeline.py
Cloud Cloud provider GCP, AWS, Azure, Other, Unknown

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make changes
  4. Run tests: make docker-test
  5. Run make quality_checks
  6. Submit a pull request

CI/CD

  • Tests: Run automatically on every PR and push to main
  • Pipeline: Runs quarterly (Jan, Apr, Jul, Oct) to update course data
  • Coverage: Minimum 80% required for pipeline files

License

MIT License - see LICENSE file.

Contact

Connect on LinkedIn

Support this project

Donate with PayPal

About

Streamlit-Powered DataTalksClub Project Analyzer: Interactive Insights at Your Fingertips

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors