DataTalksClub.Streamlit.mp4
- Introduction
- Architecture
- How It Works
- Getting Started
- View the Dashboard
- Run the Data Pipeline
- Testing
- Quality Checks
- Local Development
- Output Data
- Contributing
- License
- Contact
DataTalksClub-Projects automates the analysis of projects from DataTalksClub courses. It scrapes project submissions, generates descriptive titles using LLMs, and classifies deployment types (Batch/Streaming) and cloud providers (GCP/AWS/Azure).
Supported courses:
- DE Zoomcamp (dezoomcamp) β Batch, Streaming
- ML Zoomcamp (mlzoomcamp) β Batch, Web Service
- MLOps Zoomcamp (mlopszoomcamp) β Batch, Web Service
- LLM Zoomcamp (llmzoomcamp) β Batch, Web Service
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DataTalksClub Website β
β courses.datatalks.club/*/projects β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. SCRAPE & DISCOVER β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β Course Discovery βββββΆβ Web Scraping βββββΆβ CSV Generation β β
β β (Auto-detect new β β (BeautifulSoup) β β (project URLs) β β
β β finished courses)β βββββββββββββββββββ βββββββββββββββββββ β
β βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. MULTI-FILE FETCHING β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β GitHub API βββββΆβ Repo Analyzer βββββΆβ Key Files: β β
β β (Tree + Files) β β (Prioritization)β β β’ README.md β β
β βββββββββββββββββββ βββββββββββββββββββ β β’ docker-composeβ β
β β β’ *.tf (Terraform)β β
β Parallel fetching with ThreadPool β β’ requirements.txtβ β
β (5 workers default, configurable) β β’ Dockerfile β β
β β β’ dags/*.py β β
β βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. LLM CLASSIFICATION & TITLE GEN β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β OpenRouter API βββββΆβ Classification βββββΆβ Title Generationβ β
β β (Free LLM tier) β β β’ Deployment β β (Domain-focused,β β
β βββββββββββββββββββ β Type β β tech-accurate) β β
β β β’ Cloud Providerβ βββββββββββββββββββ β
β βββββββββββββββββββ β
β β
β Classification runs FIRST β Title uses deployment context β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. OUTPUT β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Data/{course}/{year}/data.csv β β
β β βββ project_url β β
β β βββ project_title (LLM-generated, domain-specific) β β
β β βββ Deployment Type (Batch, Streaming, Web Service) β β
β β βββ Reason (Evidence from code files) β β
β β βββ Cloud (GCP, AWS, Azure, Other, Unknown) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Course Discovery - Automatically detects finished courses from DataTalksClub website
- Web Scraping - Extracts project submission URLs from course pages
- Multi-File Fetching - For each GitHub repo, fetches 10 key files (not just README):
docker-compose.ymlβ Shows Kafka, Spark, orchestrators*.tffiles β Definitive cloud provider indicatordags/*.pyβ Airflow = Batchrequirements.txtβ DependenciesDockerfile,Makefile, etc.
- LLM Classification - Analyzes actual code to determine:
- Deployment Type: Batch (Airflow, Kestra, Mage) or Streaming (Kafka, Flink)
- Cloud Provider: GCP, AWS, Azure based on Terraform/SDK usage
- Title Generation - Creates descriptive titles based on:
- Actual project functionality (not repo name)
- Deployment type context (no "Real-Time" for Batch projects)
- Domain focus (e.g., "NYC Taxi Analytics Pipeline")
- Parallel Processing: 5-10x faster with configurable workers
- Smart Skipping: Only processes new courses by default
- Multi-File Context: Better accuracy than README-only analysis
- Course-Specific Types: Each course has valid deployment types
- Nested Project Support: Handles
/tree/main/projectURLs correctly
| Metric | Before (Sequential) | After (Parallel, 5 workers) |
|---|---|---|
| 381 projects | ~60 minutes | ~12-15 minutes |
| Throughput | ~0.1 proj/sec | ~0.5 proj/sec |
- Docker and Docker Compose
- GitHub Personal Access Token (create one) - for pipeline
- OpenRouter API Key (get free tier) - for pipeline
git clone https://github.com/dimzachar/DataTalksClub-Projects.git
cd DataTalksClub-Projects
# For pipeline: copy and edit .env
cp .env.example .env
# Build Docker image
make docker-buildThe easiest way - just visit the live app: datatalksclub-projects.streamlit.app
Or run locally with Docker:
| Make Command | Direct Docker Command |
|---|---|
docker-compose up streamlit |
Same |
Then open http://localhost:8501
| Make Command | Direct Docker Command | Description |
|---|---|---|
make docker-build |
docker-compose build |
Build Docker image (run once) |
make docker-discover |
docker-compose run --rm pipeline python -m src.pipeline_runner --discover |
See available courses |
make docker-pipeline |
docker-compose run --rm pipeline python -m src.pipeline_runner |
Process new courses only |
make docker-pipeline-all |
docker-compose run --rm pipeline python -m src.pipeline_runner --all |
Reprocess all courses |
make docker-pipeline-single COURSE=dezoomcamp YEAR=2025 |
docker-compose run --rm pipeline python -m src.pipeline_runner --course dezoomcamp --year 2025 |
Process specific course |
make docker-pipeline-project COURSE=dezoomcamp YEAR=2025 PROJECT_URL=https://github.com/example-org/example-de-project |
docker-compose run --rm pipeline python -m src.pipeline_runner --course dezoomcamp --year 2025 --project-url https://github.com/example-org/example-de-project |
Process only one project URL |
make docker-pipeline WORKERS=3 |
docker-compose run --rm pipeline python -m src.pipeline_runner --workers 3 |
Process new courses with custom worker count |
make docker-pipeline-test COURSE=dezoomcamp YEAR=2025 LIMIT=10 |
docker-compose run --rm pipeline python -m src.pipeline_runner --course dezoomcamp --year 2025 --limit 10 |
Test with limited projects |
| - | docker-compose run --rm pipeline python -m src.pipeline_runner --retry-unknowns |
Retry Unknown rows across all existing courses |
| Option | Description |
|---|---|
--discover |
List available courses and their status |
--all |
Reprocess all courses (overwrites existing) |
--retry-unknowns |
Re-run classify step on all existing courses to fix unknowns |
--course NAME |
Process specific course |
--year YEAR |
Process specific year |
--project-url URL |
Process only one repo URL (requires --course and --year) |
--limit N |
Limit to N projects (for testing) |
--workers N |
Parallel workers (default: 5), pass via WORKERS=N in make |
| Make Command | Direct Docker Command | Description |
|---|---|---|
make docker-test |
docker-compose run --rm pipeline python -m pytest tests/ -v |
Run all tests in Docker |
make docker-test-cov |
docker-compose run --rm pipeline python -m pytest tests/ -v --cov=... |
Run tests with coverage in Docker |
make test |
python -m pytest tests/ -v |
Run all tests locally |
make test-cov |
python -m pytest tests/ -v --cov=... |
Run tests with coverage locally |
make test-unit |
- | Run unit tests only |
make test-e2e |
- | Run E2E/integration tests only |
| Make Command | Direct Docker Command | Description |
|---|---|---|
make quality_checks |
- | Run isort, black, pylint locally |
make docker-quality-checks |
docker-compose run --rm pipeline python -m isort . && ... |
Run isort, black, pylint in Docker |
Requires Python 3.11. Python 3.12+ has dependency issues.
uv venv --python 3.11
.venv\Scripts\activate # Windows
source .venv/bin/activate # Linux/Mac
uv pip install -r requirements.txtpython3.11 -m venv .venv
.venv\Scripts\activate # Windows
source .venv/bin/activate # Linux/Mac
pip install -r requirements.txt| Make Command | Description |
|---|---|
make streamlit |
Run Streamlit dashboard |
make pipeline |
Process new courses |
make pipeline-all |
Reprocess all courses |
make pipeline-discover |
Show available courses |
make pipeline-single COURSE=dezoomcamp YEAR=2025 |
Process single course |
make pipeline-project COURSE=dezoomcamp YEAR=2025 PROJECT_URL=https://github.com/example-org/example-de-project |
Process single project URL |
Generated data is saved to Data/{course}/{year}/data.csv:
| Column | Description | Example |
|---|---|---|
project_url |
GitHub repository URL | https://github.com/user/repo |
project_title |
LLM-generated title | NYC Taxi Fare Analytics Pipeline |
Deployment Type |
Pipeline type | Batch, Streaming, Batch, Streaming |
Reason |
Classification evidence | Found Airflow DAG in dags/pipeline.py |
Cloud |
Cloud provider | GCP, AWS, Azure, Other, Unknown |
- Fork the repository
- Create a feature branch
- Make changes
- Run tests:
make docker-test - Run
make quality_checks - Submit a pull request
- Tests: Run automatically on every PR and push to main
- Pipeline: Runs quarterly (Jan, Apr, Jul, Oct) to update course data
- Coverage: Minimum 80% required for pipeline files
MIT License - see LICENSE file.
Connect on LinkedIn
