From 1e23151988d14af60db1a59245960d93d3c2e3b4 Mon Sep 17 00:00:00 2001 From: Lavanya-Parashar Date: Tue, 12 May 2026 19:23:34 +0530 Subject: [PATCH] level-5 submission --- .../lavanya-parashar/level5/answers.md | 375 ++++++++++++++++++ submissions/lavanya-parashar/level5/schema.md | 54 +++ 2 files changed, 429 insertions(+) create mode 100644 submissions/lavanya-parashar/level5/answers.md create mode 100644 submissions/lavanya-parashar/level5/schema.md diff --git a/submissions/lavanya-parashar/level5/answers.md b/submissions/lavanya-parashar/level5/answers.md new file mode 100644 index 000000000..565428faf --- /dev/null +++ b/submissions/lavanya-parashar/level5/answers.md @@ -0,0 +1,375 @@ +# Level 5 — Graph Thinking +## Factory Production Knowledge Graph + +## Q1. schema.md uploaded in the folder +--- +## Q2. Why Not Just SQL? + +**Question:** Which workers are certified to cover Station 016 (Gjutning) when Per Hansen is on vacation, and which projects would be affected? + +### SQL Version +```sql +SELECT + w.name AS backup_worker, + w.role AS role, + w.type AS employment_type, + p.project_name AS affected_project, + p.project_number AS project_number +FROM workers w +JOIN worker_can_cover wc + ON w.worker_id = wc.worker_id + AND wc.station_code = '016' +JOIN production p + ON p.station_code = '016' +WHERE w.name <> 'Per Hansen' + AND w.worker_id IN ( + SELECT DISTINCT worker_id + FROM worker_certifications + WHERE certification_code IN ( + SELECT certification_code + FROM station_certifications + WHERE station_code = '016' + ) + ) +GROUP BY w.name, w.role, w.type, p.project_name, p.project_number +ORDER BY w.name, p.project_name; +``` +--- +### Cypher Version + +```cypher +MATCH (per:Worker {name: "Per Hansen"})-[:WORKS_AT]->(s:Station {station_code: "016"}) +MATCH (backup:Worker)-[:CAN_COVER]->(s) +WHERE backup.name <> "Per Hansen" +MATCH (p:Project)-[:SCHEDULED_AT]->(s) +RETURN + backup.name AS backup_worker, + backup.role AS role, + backup.type AS employment_type, + collect(DISTINCT p.project_name) AS affected_projects, + count(DISTINCT p) AS project_count +ORDER BY backup.name +``` +--- +### What the Graph Makes Obvious That SQL Hides + +When I checked the actual data, Station 016 (Gjutning) has only two workers who can cover it — Per Hansen (W07, primary) and Victor Elm (W11, Foreman). If Per Hansen is on vacation, Victor Elm is the only backup, even though he is already covering all 10 stations in the factory. + +In SQL, finding this requires four tables and two nested subqueries. The relationship between "who covers a station" and "what projects run there" is hidden behind join keys. In Cypher, I just follow the path: find the station → find who covers it → find the projects scheduled there. It reads exactly like the question. + +If the question changes to "what if Victor Elm is sick?" — which affects all 10 stations — I only need to change one node in Cypher. In SQL, the entire query needs to be rewritten. +--- + +## Q3. Spot the Bottleneck + +### Weekly Capacity Deficits + +After looking at `factory_capacity.csv`, I found that 5 out of 8 weeks have more work planned than hours available: + +| Week | Capacity | Planned | Deficit | Status | +|------|--------- |---------|--------- |-------- | +| w1 | 480 hrs | 612 hrs | −132 hrs | DEFICIT | +| w2 | 520 hrs | 645 hrs | −125 hrs | DEFICIT | +| w3 | 480 hrs | 398 hrs | +82 hrs | OK | +| w4 | 500 hrs | 550 hrs | −50 hrs | DEFICIT | +| w5 | 510 hrs | 480 hrs | +30 hrs | OK | +| w6 | 440 hrs | 520 hrs | −80 hrs | DEFICIT | +| w7 | 520 hrs | 600 hrs | −80 hrs | DEFICIT | +| w8 | 500 hrs | 470 hrs | +30 hrs | OK | + +w1 and w2 are the worst. Even after adding 40 overtime hours in w2, demand still exceeds capacity by 125 hours. w6 is also risky because own staff drops to 9 people that week, bringing available hours down to 440. + +--- + +### Which Stations and Projects Are Causing the Overload + +By comparing `actual_hours` vs `planned_hours` in `factory_production.csv`, I found three stations consistently running over planned hours: + +1. Station 018 — SB B/F-hall (4 overruns, worst overall) + +2. Station 016 — Gjutning (3 overruns, highest single variance) + +3. Station 015 — Montering IQP (3 overruns) + + +All three stations were overrunning at the same time in w1 and w2, which directly explains the −132 and −125 hour deficits those weeks. Station 016 also has a staffing risk on top of this — only Per Hansen and Victor Elm can cover it, making it a single point of failure as well. + +--- + +### Cypher Query — Projects Over 10% Variance, Grouped by Station + +```cypher +MATCH (p:Project)-[r:SCHEDULED_AT]->(s:Station) +WHERE r.actual_hours > r.planned_hours * 1.1 + AND r.planned_hours > 0 +WITH + s.station_code AS station_code, + s.station_name AS station_name, + count(p) AS overload_count, + round(avg((r.actual_hours - r.planned_hours) + / r.planned_hours * 100)) AS avg_variance_pct, + collect({ + project: p.project_name, + week: r.week, + planned: r.planned_hours, + actual: r.actual_hours, + variance_pct: round((r.actual_hours - r.planned_hours) + / r.planned_hours * 100) + }) AS overloaded_projects +RETURN station_code, station_name, overload_count, + avg_variance_pct, overloaded_projects +ORDER BY avg_variance_pct DESC +``` + +--- + +### How I Would Model the Alert + +I would use a dedicated `(:Bottleneck)` node rather than adding a property to the relationship. My reasons: + +- A property on `SCHEDULED_AT` can only be found by scanning every relationship in the graph. A `(:Bottleneck)` node can be queried directly with `MATCH (b:Bottleneck {status: "ACTIVE"})`. +- One bottleneck node can link to multiple projects at once, so I can see all contributing projects in a single query. +- It has its own lifecycle — created when variance crosses 10%, updated as things change, marked RESOLVED when fixed. + +```cypher +MERGE (b:Bottleneck {station_code: "016", week: "w2"}) +SET b.severity = "HIGH", + b.avg_variance_pct = 17.2, + b.status = "ACTIVE", + b.detected_at = datetime() + +MERGE (b)-[:OVERLOADS]->(s:Station {station_code: "016"}) +MERGE (p1:Project {project_id: "P03"})-[:CONTRIBUTES_TO]->(b) +MERGE (p2:Project {project_id: "P05"})-[:CONTRIBUTES_TO]->(b) +``` +--- + +## Q4. Vector + Graph Hybrid + +### What I Would Embed + +I would create a combined text string from the most meaningful fields in each production row and embed that: + +```python +embed_text = ( + f"{row['product_type']} | " + f"qty:{row['quantity']} {row['unit']} | " + f"station:{row['station_name']} | " + f"etapp:{row['etapp']} | " + f"bop:{row['bop']} | " + f"planned:{row['planned_hours']}h" +) +``` + +I would not embed raw numbers alone — a float like `38.5` means nothing to an embedding model without context. Combined with fields like `"IQB | 1200 meters | Gjutning | ET2"`, it carries real meaning. I would also embed worker profiles (`role + certifications + primary_station`) for future skill-based matching. + +--- + +### Hybrid Query — Similar Past Projects That Also Ran on Budget + +```python +# Step 1 — Vector search (finds semantically similar projects) +query_text = ( + "450 meters IQB beams hospital extension Linköping " + "tight timeline similar scope to previous hospital projects" +) +similar_project_ids = vector_index.query(embed(query_text), top_k=10) +# Returns e.g. ["P05", "P01", "P08"] ranked by cosine similarity + +# Step 2 — Graph filter (keeps only projects with variance under 5%) +cypher = """ +WITH $similar_ids AS candidates +UNWIND candidates AS pid +MATCH (p:Project {project_id: pid})-[r:SCHEDULED_AT]->(s:Station) +WITH + p, + collect(DISTINCT s.station_name) AS stations_used, + avg(abs(r.actual_hours - r.planned_hours) + / r.planned_hours) AS avg_variance, + sum(r.planned_hours) AS total_planned_hours +WHERE avg_variance < 0.05 +MATCH (p)-[:PRODUCES]->(prod:Product) +RETURN + p.project_name AS project, + p.project_number AS number, + stations_used, + collect(DISTINCT prod.product_type) AS products, + round(avg_variance * 100, 2) AS variance_pct, + total_planned_hours +ORDER BY avg_variance ASC +LIMIT 3 +""" +results = session.run(cypher, similar_ids=similar_project_ids) +``` +--- + +### Why This Is Better Than Filtering by Product Type + +If I filter by `product_type = 'IQB'`, I get all IQB projects regardless of scale or risk. For example, Kontorshus Mölndal (P02) and Sjukhus Linköping ET2 (P05) are both IQB — but P05 has 1200 meters at 613 planned hours while P02 has just 167 planned hours. Same product type, completely different execution profile. + +My hybrid approach first finds projects that are contextually similar to the new request, then keeps only the ones that ran on budget. The result is not just "similar projects" — it is "similar projects that were actually executed successfully", which is the only kind of reference useful for planning. + +--- + +### Connection to Boardy + +Boardy uses this exact same two-step pattern but for matching people instead of projects: + +- **Vector layer:** embed person profiles (skills, goals, interests) → find people whose needs are close to your offer +- **Graph layer:** filter by community, team, and existing connections → make sure the match is also structurally real + +Semantic similarity alone gives false positives. The graph layer is what makes the match meaningful. + +--- + +## Q5. My L6 Blueprint + +### Node Labels → CSV Mapping + +| Node | Properties | CSV Source | Count | + +| `Project` | `project_id`, `project_number`, `project_name` | factory_production.csv | 8 nodes +| `Product` | `product_type`, `unit`, `quantity`, `unit_factor` | factory_production.csv | 7 nodes +| `Station` | `station_code`, `station_name` | factory_production.csv | 10 nodes +| `Worker` | `worker_id`, `name`, `role`, `hours_per_week`, `type` | factory_workers.csv | 14 nodes +| `Week` | `week` | both CSVs | 8 nodes +| `Etapp` | `etapp` | factory_production.csv | 2 (ET1, ET2) +| `Capacity` | `own_staff_count`, `hired_staff_count`, `own_hours`, `hired_hours`, `overtime_hours`, `total_capacity`, `total_planned`, `deficit` | factory_capacity.csv | 8 nodes + +Total nodes: 57 (minimum required: 50) + +--- + +### Relationship Types → What Creates Them + +| Relationship | Properties | Created From | Count | + +| `(Project)-[:PRODUCES]->(Product)` | — | unique project + product pairs | 32 | +| `(Project)-[:SCHEDULED_AT]->(Station)` | `week`, `planned_hours`, `actual_hours`, `completed_units`, `bop` | every row in production CSV | 68 | +| `(Project)-[:RUNS_IN]->(Week)` | — | unique project + week pairs | 20 | +| `(Project)-[:IN_ETAPP]->(Etapp)` | — | unique project + etapp pairs | 8 | +| `(Product)-[:PROCESSED_AT]->(Station)` | — | unique product + station pairs | 16 | +| `(Worker)-[:WORKS_AT]->(Station)` | — | `primary_station` field | 13 | +| `(Worker)-[:CAN_COVER]->(Station)` | — | each value in `can_cover_stations` split by comma | 31 | +| `(Worker)-[:AVAILABLE_IN]->(Week)` | `hours_per_week` | every worker × every week | 112 | +| `(Week)-[:HAS_CAPACITY]->(Capacity)` | — | one row per week in capacity CSV | 8 | + +Total relationships: 308 (minimum required: 100) +Relationship types: 9 (minimum required: 8) + +--- + +### seed_graph.py — Key Rules + +1. Create uniqueness constraints before loading any data +2. Always use `MERGE`, never `CREATE` — makes the script safe to run twice +3. Load all nodes first, then relationships — never create a relationship before both end nodes exist +4. Split `can_cover_stations` by comma before creating `CAN_COVER` relationships +5. Store `week`, `planned_hours`, `actual_hours`, `completed_units`, `bop` as properties on `SCHEDULED_AT` — the self-test variance query reads from these + +--- + +### Dashboard — 4 Pages with Cypher Queries + +**Page 1 — Project Overview** + +Shows all 8 projects with total planned hours, total actual hours, variance %, and products involved. Variance color-coded: green under 5%, amber 5–10%, red above 10%. + +```cypher +MATCH (p:Project)-[r:SCHEDULED_AT]->(s:Station) +OPTIONAL MATCH (p)-[:PRODUCES]->(prod:Product) +RETURN + p.project_id AS id, + p.project_name AS project, + p.project_number AS number, + sum(r.planned_hours) AS total_planned, + sum(r.actual_hours) AS total_actual, + round((sum(r.actual_hours) - sum(r.planned_hours)) + / sum(r.planned_hours) * 100) AS variance_pct, + collect(DISTINCT prod.product_type) AS products +ORDER BY variance_pct DESC +``` + +**Page 2 — Station Load** + +Grouped Plotly bar chart — stations on x-axis, planned vs actual hours on y-axis, with a week dropdown to filter. Bars where actual > planned are shown in red. + +```cypher +MATCH (p:Project)-[r:SCHEDULED_AT]->(s:Station) +RETURN + s.station_code AS station_code, + s.station_name AS station, + r.week AS week, + sum(r.planned_hours) AS planned_hours, + sum(r.actual_hours) AS actual_hours +ORDER BY station_code, week +``` + +**Page 3 — Capacity Tracker** + +Dual-line Plotly chart — total capacity vs total planned per week. The 5 deficit weeks (w1, w2, w4, w6, w7) are highlighted with a red background band and deficit number on each point. + +```cypher +MATCH (w:Week)-[:HAS_CAPACITY]->(c:Capacity) +RETURN + w.week AS week, + c.own_hours AS own_hours, + c.hired_hours AS hired_hours, + c.overtime_hours AS overtime_hours, + c.total_capacity AS total_capacity, + c.total_planned AS total_planned, + c.deficit AS deficit +ORDER BY week +``` + +**Page 4 — Worker Coverage Matrix** + +A heatmap table — rows are workers, columns are stations, each cell shows WORKS_AT, CAN_COVER, or empty. Stations with only one covering worker are flagged red as single points of failure. In this dataset, Station 016 is the highest-risk SPOF — only Per Hansen and Victor Elm cover it. + +```cypher +MATCH (w:Worker)-[r:WORKS_AT|CAN_COVER]->(s:Station) +RETURN + w.name AS worker, + w.role AS role, + s.station_code AS station_code, + s.station_name AS station, + type(r) AS coverage_type +ORDER BY w.name, s.station_code +``` + +SPOF detection query: + +```cypher +MATCH (w:Worker)-[:WORKS_AT|CAN_COVER]->(s:Station) +WITH s, collect(DISTINCT w.name) AS workers, count(DISTINCT w) AS coverage +RETURN + s.station_code AS station_code, + s.station_name AS station, + workers AS covering_workers, + coverage AS worker_count, + CASE WHEN coverage = 1 + THEN " SINGLE POINT OF FAILURE" + ELSE " OK" END AS risk_status +ORDER BY coverage ASC +``` + +--- + +### Self-Test — Adapted Variance Query + +The default self-test code uses `p.name` and `s.name` which do not exist in my schema. I will use this instead so Check 6 returns results correctly: + +```cypher +MATCH (p:Project)-[r:SCHEDULED_AT]->(s:Station) +WHERE r.actual_hours > r.planned_hours * 1.1 +RETURN + p.project_name AS project, + s.station_name AS station, + r.planned_hours AS planned, + r.actual_hours AS actual +LIMIT 10 +``` + +This returns results from Stations 018, 016, and 015 where actual hours exceed planned by more than 10% across multiple projects. + +--------------LEVEL 5 COMPLETED-------------- \ No newline at end of file diff --git a/submissions/lavanya-parashar/level5/schema.md b/submissions/lavanya-parashar/level5/schema.md new file mode 100644 index 000000000..d4c390a06 --- /dev/null +++ b/submissions/lavanya-parashar/level5/schema.md @@ -0,0 +1,54 @@ +--- +config: + layout: dagre + look: classic +--- +flowchart TB + Project[" Project + ─────────── + project_id + project_number + project_name"] -- PRODUCES --> Product[" Product + ─────────── + product_type + unit + quantity + unit_factor"] + Project -- SCHEDULED_AT {planned_hours, actual_hours, completed_units, bop} --> Station[" Station + ─────────── + station_code + station_name"] + Project -- RUNS_IN --> Week[" Week + ─────────── + week"] + Project -- IN_ETAPP --> Etapp[" Etapp + ─────────── + etapp"] + Product -. PROCESSED_AT .-> Station + Worker[" Worker + ─────────── + worker_id + name + role + hours_per_week + type"] -- WORKS_AT --> Station + Worker -. CAN_COVER .-> Station + Worker -- AVAILABLE_IN {hours_per_week} --> Week + Week -- HAS_CAPACITY --> Capacity[" Capacity + ─────────── + own_staff_count + hired_staff_count + own_hours + hired_hours + overtime_hours + total_capacity + total_planned + deficit"] + + style Project fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a + style Product fill:#dcfce7,stroke:#16a34a,color:#14532d + style Station fill:#fef9c3,stroke:#ca8a04,color:#713f12 + style Week fill:#ede9fe,stroke:#7c3aed,color:#3b0764 + style Etapp fill:#ffedd5,stroke:#ea580c,color:#7c2d12 + style Worker fill:#fce7f3,stroke:#db2777,color:#831843 + style Capacity fill:#ccfbf1,stroke:#0d9488,color:#134e4a \ No newline at end of file