Skip to content

Add generalized (relations …) CSV loading construct#259

Merged
hbarthels merged 10 commits into
mainfrom
hb-generalize-csv-loading
Jun 23, 2026
Merged

Add generalized (relations …) CSV loading construct#259
hbarthels merged 10 commits into
mainfrom
hb-generalize-csv-loading

Conversation

@hbarthels

@hbarthels hbarthels commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a new (relations …) construct on CSVData, alongside the existing (columns …) (GNF) form, to support more general CSV loading: a shared set of key columns (or the special METADATA$KEY) plus one or more output relations, each with its own (possibly empty) value columns, with optional CDC grouping into (inserts …)/(deletes …).

This lets a load:

  • put several columns into a single relation, and
  • choose its own key column(s) instead of the implicit row id.

The legacy (columns …) form is untouched and remains fully supported (the two are mutually exclusive on a given CSVData).

Some examples:

;; No CDC. Produces a binary `edge` relation with keys `(src, dst)`.
(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (relation :edge)))

;; No CDC. Produces a arity 4 `edge` relation with weights and labels. Keys: `(src, dst)`. Values: `(weight, label)`
(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (relation :edge (column "weight" FLOAT) (column "label" STRING))))

;; CDC. Produces two output relations:
;; - `edge_insertions`, keys `(src, dst)`, values `(weight, label)`. Contains only insertions.
;; - `edge_deletions`, keys `(src, dst)`, values `()`. Contains only deletions.
(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (inserts
      (relation :edge_insertions (column "weight" FLOAT) (column "label" STRING)))
    (deletes
      (relation :edge_deletions))))

;; No CDC. Produces two output relations:
;; - `weights`, keys `(src, dst)`, values `(weight)`
;; - `labels`, keys `(src, dst)`, values `(label)`
(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (relation :weights (column "weight" FLOAT))
    (relation :labels (column "label" STRING))))

;; CDC GNF data load.
(relations
  (keys
    (column "METADATA$KEY" UINT128))
  (outputs
    (inserts
      (relation :aaa (column "aaa" INT))
      (relation :bbb (column "bbb" FLOAT))
      (relation :meta_key_insert))
    (deletes
      (relation :meta_key_delete))))

Changes

  • proto (logic.proto): NamedColumn, OutputRelation, Relations messages; optional Relations relations on CSVData.
  • grammar (grammar.y): relations / relation_keys / output_relation / named_column rules. relation_body returns a concrete Relations (not a tuple) so the Go parser stays type-stable.
  • Regenerated Python / Julia / Go parsers, pretty-printers, and protobuf bindings.
  • Julia SDK: global_ids and ==/hash/isequal extended for the new messages.
  • Fixtures: relations_edge_binary, relations_edge_arity4, relations_split, relations_cdc (+ regenerated bin/pretty/pretty_debug snapshots).

🤖 Generated with Claude Code

hbarthels and others added 4 commits June 13, 2026 00:45
Adds a new `(relations …)` construct on `CSVData` alongside the legacy
`(columns …)` form: a shared set of key columns (or the special
`METADATA$KEY`) plus one or more output relations, each with its own
(possibly empty) value columns, with optional CDC `(inserts …)`/`(deletes …)`
grouping.

- proto: `NamedColumn` / `OutputRelation` / `Relations` messages + an optional
  `Relations relations` field on `CSVData` (mutually exclusive with `columns`)
- grammar: `relations` / `relation_keys` / `output_relation` / `named_column`
  rules; `relation_body` returns a concrete `Relations` to keep the Go parser
  type-stable
- regenerated Python / Julia / Go parsers, pretty-printers, and protobuf
  bindings; `global_ids` + equality extended for the new messages (Julia SDK)
- `.lqp` fixtures (binary edge, arity-4 edge, two-relation split, CDC) +
  regenerated bin / pretty / pretty_debug snapshots

`make test` green across Python, Julia, and Go.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Align the CDC relations fixture with the rest of the edge-themed
suite: use s3://bucket/edges.csv (matching relations_edge_binary /
relations_edge_arity4) and rename the delta relations to :weight_ins /
:weight_del, since they carry a weight column. Regenerated the .bin
and pretty/pretty_debug snapshot goldens (name-derived relation hashes
updated accordingly).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts:
#	sdks/go/src/parser.go
#	sdks/julia/LogicalQueryProtocol.jl/src/parser.jl
#	sdks/python/src/lqp/gen/parser.py
Regenerate parsers, protobuf bindings, pretty printers, and test
fixtures (.bin + pretty/pretty_debug snapshots) from the post-merge
grammar and protos. Parsers changed; protobuf bindings and printers
were already in sync. Test artifacts pick up master's new
:csv_compression default ("auto" -> ""). All Python/Go/Julia tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@hbarthels hbarthels marked this pull request as ready for review June 15, 2026 15:24
@hbarthels hbarthels requested a review from comnik June 15, 2026 15:24
Comment thread sdks/julia/LogicalQueryProtocol.jl/src/equality.jl
Comment thread proto/relationalai/lqp/v1/logic.proto Outdated
Comment on lines +291 to +314
// A single named CSV column with its type. Used to describe both shared key columns and
// per-relation value columns in the generalized `Relations` loading construct.
message NamedColumn {
string name = 1; // CSV column name (e.g. "src"); special name "METADATA$KEY" => derived hash
Type type = 2; // Column type
}

// One output relation: the shared keys plus this relation's own (possibly empty) value columns.
message OutputRelation {
RelationId target_id = 1; // Output relation path
repeated NamedColumn values = 2; // Value columns for this relation (may be empty)
}

// Generalized CSV loading: a shared set of key columns and one or more output relations.
// CDC vs non-CDC is implied by which group is populated:
// - `relations` populated => non-CDC outputs
// - `inserts`/`deletes` => CDC insert/delete groups
message Relations {
repeated NamedColumn keys = 1; // Shared key columns (name "METADATA$KEY" => derived hash)
repeated OutputRelation relations = 2; // Non-CDC outputs
repeated OutputRelation inserts = 3; // CDC insert group
repeated OutputRelation deletes = 4; // CDC delete group
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if relations was maybe too generic to burn on this use case. Seems like you arrived at a similar conclusion, given that you went with OutputRelation. But Output also has a different meaning in LQP already. How about TargetRelations and TargetRelation?

Let's also keep this generic and not tied to CSV specifically, I assume we can reuse this for other types of external data. I don't think anything you have above is specific to CSV, except for the comments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for Target

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Yeah, the (relations ...) construct is not specific to CSV. I was planning to use the same for Iceberg. I will update the comments.

@hbarthels hbarthels Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how I missed that, but I just noticed that the outputs keyword that I used in the examples above is missing from the grammar. 🤦‍♂️ So at the moment it's

(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (relation :edge))

instead of

(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (relation :edge)))

and

(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
    (inserts
      (relation :edge_insertions (column "weight" FLOAT) (column "label" STRING)))
    (deletes
      (relation :edge_deletions)))

instead of

(relations
  (keys
    (column "src" INT)
    (column "dst" INT))
  (outputs
    (inserts
      (relation :edge_insertions (column "weight" FLOAT) (column "label" STRING)))
    (deletes
      (relation :edge_deletions))))

Do you have a preference between leaving it as it is, adding (outputs ...), or maybe using (targets ...)?

@davidwzhao

Copy link
Copy Markdown
Contributor

Could we get some more concrete examples, e.g., for each load config, what the CSV file looks like and what are the shapes of the resulting relations?

hbarthels and others added 3 commits June 16, 2026 17:15
Rename the generalized CSV-loading proto messages Relations ->
TargetRelations and OutputRelation -> TargetRelation, plus the matching
grammar nonterminals (relations -> target_relations, output_relation ->
target_relation) and all comments. The LQP s-expression syntax is
unchanged: the (relations …)/(relation …)/(inserts …)/(deletes …)
keywords, proto field names, and wire format are all preserved (no .bin
or pretty-snapshot changes). Regenerated parsers, printers, and protobuf
bindings for all three SDKs; updated hand-written Julia equality.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The generalized loading construct (NamedColumn / TargetRelation /
TargetRelations) is not CSV-specific — it will be used for other input
types too. Remove "CSV" from those messages' comments; CSV wording stays
on the CSV-specific CSVData message. Regenerated the Go binding (the
only generated SDK that embeds proto comments).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The generalized-loading SDK types added in this branch had no coverage in
equality_tests.jl. Add testitems for NamedColumn, TargetRelation, and
TargetRelations following the existing equality/inequality/hash/
reflexivity/symmetry/transitivity pattern, including the non-CDC
(relations) and CDC (inserts/deletes) groupings of TargetRelations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread proto/relationalai/lqp/v1/logic.proto Outdated
Comment on lines +306 to +307
// - `relations` populated => non-CDC outputs
// - `inserts`/`deletes` => CDC insert/delete groups

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the relations and the inserts/deletes portion of TargetRelations are mutually exclusive? e.g. either one or the other is populated? If so, maybe we can express this in a OneOf of CDC/non-CDC?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they should be mutually exclusive. I will look into that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to a OneOf.

@hbarthels

Copy link
Copy Markdown
Contributor Author

Could we get some more concrete examples, e.g., for each load config, what the CSV file looks like and what are the shapes of the resulting relations?

Sure. I asked Claude to generate a few examples and added some comments:

1. Binary edge (no CDC)

(relations
  (keys (column "src" INT) (column "dst" INT))
  (outputs (relation :edge)))

CSV

src,dst
1,2
1,3
4,5

Resulting relations

  • :edge — key (src::Int, dst::Int), no values (arity 2)

Missing value are not allowed in this case, and we assume that the keys are unique.

2. Arity-4 edge with values (no CDC)

(relations
  (keys (column "src" INT) (column "dst" INT))
  (outputs (relation :edge (column "weight" FLOAT) (column "label" STRING))))

CSV

src,dst,weight,label
1,2,0.5,road
1,3,1.5,rail

Resulting relations

  • :edge — key (src::Int, dst::Int) → value (weight::Float64, label::String) (arity 4)

In this case, value rows must either be fully present or fully missing. Anything in between is not allowed.

3. CDC with a custom key — insertions + keys-only deletions

(relations
  (keys (column "src" INT) (column "dst" INT))
  (outputs
    (inserts (relation :edge_insertions (column "weight" FLOAT) (column "label" STRING)))
    (deletes (relation :edge_deletions))))

A CDC CSV carries the metadata columns METADATA$ACTION / METADATA$ISUPDATE / METADATA$ROW_ID. METADATA$ACTION routes each row to the insert group or the delete group. Because the deletes relation declares no value columns, DELETE rows only need their key columns populated (the value columns are ignored / may be empty):

CSV

src,dst,weight,label,METADATA$ACTION,METADATA$ISUPDATE,METADATA$ROW_ID
1,2,0.5,road,INSERT,false,00000000000000000000000000000001
1,3,1.5,rail,INSERT,false,00000000000000000000000000000002
4,5,,,DELETE,false,00000000000000000000000000000003

Resulting relations

  • :edge_insertions — key (src, dst) → value (weight, label), only INSERT rows:
    edge_insertions(1, 2, 0.5, "road"), edge_insertions(1, 3, 1.5, "rail")
  • :edge_deletions — key (src, dst), no values, only DELETE rows (the key identifies the row to delete):
    edge_deletions(4, 5)

4. Split into two relations sharing a key (no CDC)

(relations
  (keys (column "src" INT) (column "dst" INT))
  (outputs
    (relation :weights (column "weight" FLOAT))
    (relation :labels (column "label" STRING))))

One CSV row populates both relations; each picks out its own value column under the shared key.

CSV

src,dst,weight,label
1,2,0.5,road
1,3,1.5,rail

Resulting relations

  • :weights — key (src, dst) → value (weight):
    weights(1, 2, 0.5), weights(1, 3, 1.5)
  • :labels — key (src, dst) → value (label):
    labels(1, 2, "road"), labels(1, 3, "rail")

In this case, it's fine for weight or label to be missing.

5. CDC, row-hash key (METADATA$KEY) — the key-set pattern

(relations
  (keys (column "METADATA$KEY" UINT128))
  (outputs
    (inserts
      (relation :aaa (column "aaa" INT))
      (relation :bbb (column "bbb" FLOAT))
      (relation :meta_key_insert))
    (deletes
      (relation :meta_key_delete))))

METADATA$KEY is not a literal CSV column — it's a special key name meaning "derive a UINT128 row hash from the CSV's METADATA$ROW_ID column." That hash h is the shared key for every output relation. INSERT rows feed all three insert relations; DELETE rows feed the delete relation. :meta_key_insert / :meta_key_delete declare no value columns, so they are value-less key sets (just the row hash).

CSV

aaa,bbb,METADATA$ACTION,METADATA$ISUPDATE,METADATA$ROW_ID
10,0.5,INSERT,false,00000000000000000000000000000001
20,1.5,INSERT,false,00000000000000000000000000000002
,,DELETE,false,00000000000000000000000000000003

Let h1, h2, h3 be the UINT128 hashes of row-ids …0001, …0002, …0003.

Resulting relations (all keyed by the row hash h::UInt128)

  • :aaa — key (h) → value (aaa::Int), INSERT rows only:
    aaa(h1, 10), aaa(h2, 20)
  • :bbb — key (h) → value (bbb::Float64), INSERT rows only:
    bbb(h1, 0.5), bbb(h2, 1.5)
  • :meta_key_insert — key (h), no values; the set of inserted row hashes:
    meta_key_insert(h1), meta_key_insert(h2)
  • :meta_key_delete — key (h), no values; the set of deleted row hashes:
    meta_key_delete(h3)

Replace the flat relations/inserts/deletes fields on TargetRelations
with a `oneof body { PlainTargets plain; CdcTargets cdc; }`, making the
mutually-exclusive plain (non-CDC) and CDC modes explicit instead of
inferred from which repeated field happens to be populated. PlainTargets
wraps `targets`; CdcTargets wraps `inserts`/`deletes` (oneof can't hold
repeated fields directly, hence the wrapper messages).

The LQP s-expression syntax is unchanged — only the grammar's
construct/deconstruct helpers and the deconstruct guard (now switching on
the oneof case via has_proto_field) change, so pretty/pretty_debug
snapshots are untouched; only .bin wire encodings shift. Regenerated all
three SDKs; updated hand-written Julia equality (PlainTargets/CdcTargets
+ oneof-based TargetRelations), properties global_ids, and equality
tests. All Python/Go/Julia suites pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread proto/relationalai/lqp/v1/logic.proto Outdated
Comment thread proto/relationalai/lqp/v1/logic.proto Outdated
Comment thread proto/relationalai/lqp/v1/logic.proto Outdated
hbarthels and others added 2 commits June 23, 2026 17:16
Use the conventional all-caps CDC acronym for the message name. The
oneof field stays `cdc`; only the message type name changes, so there is
no wire-format or s-expression syntax change (no .bin/snapshot diffs).
Regenerated all three SDKs and updated the hand-written Julia equality +
equality tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
On the LQP side a key column name is just a string — there is no special
handling of METADATA$KEY (the row-hash derivation happens downstream in
the loader). Remove the "special name METADATA$KEY => derived hash"
wording from the NamedColumn.name and TargetRelations.keys proto comments
(regenerated the Go binding), and reword the analogous comment in the
relations_cdc fixture. The METADATA$KEY column-name strings in fixtures
are real data and stay.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@hbarthels hbarthels merged commit b16841e into main Jun 23, 2026
5 checks passed
@hbarthels hbarthels deleted the hb-generalize-csv-loading branch June 23, 2026 15:27
@hbarthels hbarthels mentioned this pull request Jun 23, 2026
hbarthels added a commit that referenced this pull request Jun 23, 2026
Bumps the Julia and Python SDK versions to 0.5.4. The
`(relations …)` CSV loading construct (#259) landed on main after
the v0.5.3 tag without a version bump, so main currently reports
0.5.3 while containing changes beyond that release. This cuts a
distinct version so the content is traceable to a tagged release.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants