lancedb · prrao87 · May 12, 2026 · May 12, 2026 · May 12, 2026 · May 12, 2026
diff --git a/Makefile b/Makefile
@@ -1,8 +1,9 @@
 # Paths
 SCRIPT := scripts/mdx_snippets_gen.py
+HF_SYNC_SCRIPT := scripts/sync_hf_datasets.py
 
 # uv run automatically handles virtualenv, so no activation needed
-.PHONY: py ts rs snippets
+.PHONY: py ts rs snippets hf-sync
 
 # Generate Python MDX snippets
 py:
@@ -19,3 +20,9 @@ rs:
 # Convenience: generate all snippets
 snippets: py ts rs
 
+# Sync Lance dataset cards from lance-format/lance-huggingface into docs/datasets/.
+# Regenerates per-dataset MDX pages, the landing-page card grid, and the
+# Datasets tab in docs.json based on scripts/hf_datasets.yaml.
+hf-sync:
+	@uv run $(HF_SYNC_SCRIPT)
+
diff --git a/README.md b/README.md
@@ -67,3 +67,45 @@ code that's been tested (per recent LanceDB releases) in the hands of users.
 > As far as possible, do not add code snippets manually inside triple-backticks! Write the tests for
 > the required language in `tests/*` directory, then generate the snippets programmatically via the Makefile
 > commands.
+
+## Sync Hugging Face dataset pages
+
+The `Datasets` tab is populated from [`lance-format/lance-huggingface`](https://github.com/lance-format/lance-huggingface),
+the master repository where each Lance dataset published under the [`lance-format`](https://huggingface.co/lance-format)
+Hugging Face organization has its own directory with an `HF_DATASET_CARD.md`. That same file is what gets pushed to
+the Hub as the dataset's `README.md` via the `hf` CLI, so the GitHub repo is the single source of truth for the
+content of every dataset card.
+
+To avoid maintaining the same content in two places, the per-dataset MDX pages under `docs/datasets/` are
+generated from those upstream cards via `scripts/sync_hf_datasets.py`. The script:
+
+1. Reads `scripts/hf_datasets.yaml`, which lists every dataset to publish and maps the upstream directory name,
+   the URL slug, the HF Hub repo, and the human-readable title.
+2. Fetches each `HF_DATASET_CARD.md` from `lance-format/lance-huggingface` on GitHub.
+3. Rewrites the frontmatter for Mintlify (sets `title`, `sidebarTitle`, `description`), strips the upstream H1,
+   injects a "View on Hugging Face" card at the top, and sanitizes known MDX hazards (bibtex citations outside
+   code fences, literal `<>` in prose).
+4. Writes `docs/datasets/<slug>.mdx`, regenerates the card grid in `docs/datasets/index.mdx` between the
+   `HF_SYNC:START` / `HF_SYNC:END` markers, and updates the `Datasets` tab in `docs/docs.json` to keep the
+   sidebar in sync.
+
+Run it from the repo root:
+
+```bash
+make hf-sync
+```
+
+### Adding a new dataset
+
+1. Author the new dataset's `HF_DATASET_CARD.md` upstream in `lance-format/lance-huggingface` (and push it to the
+   Hub as usual).
+2. Add a single line for the dataset under the appropriate category in `scripts/hf_datasets.yaml`. The four
+   fields (`dir`, `slug`, `hf`, `title`) are explicit because the GitHub directory name, the HF Hub repo slug,
+   and the desired URL slug don't follow a derivable convention.
+3. Run `make hf-sync`. The script will fetch the new card, generate `docs/datasets/<slug>.mdx`, refresh the
+   landing-page card grid, and add the new page to the `Datasets` tab in `docs/docs.json`.
+4. Preview locally with `mint dev` and commit the changes (the MDX page, the regenerated `index.mdx`, the
+   updated `docs.json`, and the new yaml entry).
+
+If you remove a dataset from the yaml, the next `make hf-sync` will delete its MDX file and drop the sidebar
+entry. The script hard-fails on any fetch error — partial regeneration would be worse than a clear error.
diff --git a/docs/datasets/ade20k.mdx b/docs/datasets/ade20k.mdx
@@ -0,0 +1,169 @@
+---
+title: "ADE20K"
+sidebarTitle: "ADE20K"
+description: "Lance-formatted version of the full ADE20K scene parsing benchmark (sourced from 1aurent/ADE20K) — 27,574 scene images with semantic and instance segmentation maps, scene labels, and per-object metadata, all stored inline."
+---
+
+<Card
+  title="View on Hugging Face"
+  icon="/static/assets/logo/huggingface-logo.svg"
+  href="https://huggingface.co/datasets/lance-format/ade20k-lance"
+>
+Source dataset card and downloadable files for `lance-format/ade20k-lance`.
+</Card>
+
+Lance-formatted version of the full [ADE20K scene parsing benchmark](https://groups.csail.mit.edu/vision/datasets/ADE20K/) (sourced from [`1aurent/ADE20K`](https://huggingface.co/datasets/1aurent/ADE20K)) — **27,574 scene images** with semantic and instance segmentation maps, scene labels, and per-object metadata, all stored inline.
+
+## Splits
+
+| Split | Rows |
+|-------|------|
+| `train.lance`      | 25,574 |
+| `validation.lance` | 2,000 |
+
+## Schema
+
+| Column | Type | Notes |
+|---|---|---|
+| `id` | `int64` | Row index within split |
+| `image` | `large_binary` | Inline JPEG bytes |
+| `segmentation` | `large_binary` | Inline PNG bytes — semantic segmentation map (RGB encoding per ADE20K spec) |
+| `instance` | `large_binary?` | Inline PNG bytes — instance map; null if not provided |
+| `filename` | `string` | ADE20K relative filename |
+| `scene` | `list<string>` | Scene labels (e.g. `["bathroom"]`) |
+| `object_names` | `list<string>` | Names of all annotated objects (one entry per polygon) |
+| `objects_present` | `list<string>` | Deduped object names — feeds the `LABEL_LIST` index |
+| `num_objects` | `int32` | Number of annotated objects |
+| `image_emb` | `fixed_size_list<float32, 512>` | OpenCLIP `ViT-B-32` image embedding (cosine-normalized) |
+
+## Pre-built indices
+
+- `IVF_PQ` on `image_emb` — `metric=cosine`
+- `BTREE` on `num_objects`
+- `LABEL_LIST` on `objects_present` — supports `array_has_any` / `array_has_all`
+
+## Quick start
+
+```python
+import lance
+
+ds = lance.dataset("hf://datasets/lance-format/ade20k-lance/data/validation.lance")
+print(ds.count_rows(), ds.schema.names, ds.list_indices())
+```
+
+## Load with LanceDB
+
+These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.
+
+```python
+import lancedb
+
+db = lancedb.connect("hf://datasets/lance-format/ade20k-lance/data")
+tbl = db.open_table("validation")
+print(f"LanceDB table opened with {len(tbl)} scene images")
+```
+
+## Read an image with its segmentation
+
+```python
+import io
+import lance
+from PIL import Image
+
+ds = lance.dataset("hf://datasets/lance-format/ade20k-lance/data/validation.lance")
+row = ds.take([0], columns=["image", "segmentation", "scene", "objects_present"]).to_pylist()[0]
+
+Image.open(io.BytesIO(row["image"])).save("img.jpg")
+Image.open(io.BytesIO(row["segmentation"])).save("seg.png")
+print("scene:", row["scene"])
+print("objects:", row["objects_present"][:10])
+```
+
+## Filter by scene / objects
+
+```python
+import lance
+ds = lance.dataset("hf://datasets/lance-format/ade20k-lance/data/validation.lance")
+
+# Indoor scenes containing both a bed and a window.
+rows = ds.scanner(
+    filter="array_has_all(objects_present, ['bed', 'window'])",
+    columns=["filename", "scene"],
+    limit=10,
+).to_table().to_pylist()
+```
+
+### Filter with LanceDB
+
+```python
+import lancedb
+
+db = lancedb.connect("hf://datasets/lance-format/ade20k-lance/data")
+tbl = db.open_table("validation")
+
+rows = (
+    tbl.search()
+    .where("array_has_all(objects_present, ['bed', 'window'])")
+    .select(["filename", "scene"])
+    .limit(10)
+    .to_list()
+)
+```
+
+## Visual similarity search
+
+```python
+import lance
+import pyarrow as pa
+
+ds = lance.dataset("hf://datasets/lance-format/ade20k-lance/data/validation.lance")
+emb_field = ds.schema.field("image_emb")
+ref = ds.take([0], columns=["image_emb"]).to_pylist()[0]["image_emb"]
+query = pa.array([ref], type=emb_field.type)
+
+neighbors = ds.scanner(
+    nearest={"column": "image_emb", "q": query[0], "k": 5},
+    columns=["filename", "scene"],
+).to_table().to_pylist()
+```
+
+### LanceDB visual similarity search
+
+```python
+import lancedb
+
+db = lancedb.connect("hf://datasets/lance-format/ade20k-lance/data")
+tbl = db.open_table("validation")
+
+ref = tbl.search().limit(1).select(["image_emb"]).to_list()[0]
+query_embedding = ref["image_emb"]
+
+results = (
+    tbl.search(query_embedding)
+    .metric("cosine")
+    .select(["filename", "scene"])
+    .limit(5)
+    .to_list()
+)
+```
+
+## Why Lance?
+
+- One dataset for images + segmentation + instance + scene + objects + embeddings + indices — no folder of paired files.
+- On-disk vector and label-list indices live next to the data, so search works on local copies and on the Hub.
+- Schema evolution: add columns (panoptic ids, fresh embeddings, model predictions) without rewriting the data.
+
+## Source & license
+
+Converted from [`1aurent/ADE20K`](https://huggingface.co/datasets/1aurent/ADE20K). ADE20K is released under the [BSD 3-Clause license](https://github.com/CSAILVision/ADE20K/blob/main/LICENSE) by the MIT CSAIL Computer Vision group.
+
+## Citation
+
+```
+@inproceedings{zhou2017scene,
+  title={Scene Parsing through ADE20K Dataset},
+  author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
+  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
+  year={2017}
+}
+```
diff --git a/docs/datasets/chartqa.mdx b/docs/datasets/chartqa.mdx
@@ -0,0 +1,114 @@
+---
+title: "ChartQA"
+sidebarTitle: "ChartQA"
+description: "Lance-formatted version of ChartQA — VQA over scientific and business charts that combine logical and visual reasoning — sourced from lmms-lab/ChartQA."
+---
+
+<Card
+  title="View on Hugging Face"
+  icon="/static/assets/logo/huggingface-logo.svg"
+  href="https://huggingface.co/datasets/lance-format/chartqa-lance"
+>
+Source dataset card and downloadable files for `lance-format/chartqa-lance`.
+</Card>
+
+Lance-formatted version of [ChartQA](https://github.com/vis-nlp/ChartQA) — VQA over scientific and business charts that combine logical and visual reasoning — sourced from [`lmms-lab/ChartQA`](https://huggingface.co/datasets/lmms-lab/ChartQA).
+
+## Splits
+
+| Split | Rows |
+|-------|------|
+| `test.lance` | 2,500 |
+
+> The `lmms-lab/ChartQA` redistribution exposes test only. Train and validation live in the original release (https://github.com/vis-nlp/ChartQA); add them via `chartqa/dataprep.py --splits` once a parquet mirror is identified.
+
+## Schema
+
+| Column | Type | Notes |
+|---|---|---|
+| `id` | `int64` | Row index |
+| `image` | `large_binary` | Inline chart image bytes |
+| `image_id` / `question_id` | `string?` | (Source does not assign explicit ids — null for now) |
+| `question` | `string` | Natural-language question |
+| `answers` | `list<string>` | Reference answer (typically a single string) |
+| `answer` | `string` | First answer — used as canonical |
+| `type` | `string?` | Question type (`human` vs `augmented`) |
+| `image_emb` | `fixed_size_list<float32, 512>` | CLIP image embedding (cosine-normalized) |
+| `question_emb` | `fixed_size_list<float32, 512>` | CLIP text embedding of the question |
+
+## Pre-built indices
+
+- `IVF_PQ` on `image_emb` and `question_emb` — `metric=cosine`
+- `INVERTED` (FTS) on `question` and `answer`
+- `BITMAP` on `type`
+
+## Quick start
+
+```python
+import lance
+ds = lance.dataset("hf://datasets/lance-format/chartqa-lance/data/test.lance")
+print(ds.count_rows(), ds.schema.names, ds.list_indices())
+```
+
+## Load with LanceDB
+
+These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.
+
+```python
+import lancedb
+
+db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data")
+tbl = db.open_table("test")
+print(f"LanceDB table opened with {len(tbl)} chart-question pairs")
+```
+
+### LanceDB vector search
+
+```python
+import lancedb
+
+db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data")
+tbl = db.open_table("test")
+
+ref = tbl.search().limit(1).select(["question_emb", "question"]).to_list()[0]
+query_embedding = ref["question_emb"]
+
+results = (
+    tbl.search(query_embedding, vector_column_name="question_emb")
+    .metric("cosine")
+    .select(["question", "answer"])
+    .limit(5)
+    .to_list()
+)
+```
+
+### LanceDB full-text search
+
+```python
+import lancedb
+
+db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data")
+tbl = db.open_table("test")
+
+results = (
+    tbl.search("percentage")
+    .select(["question", "answer"])
+    .limit(10)
+    .to_list()
+)
+```
+
+## Source & license
+
+Converted from [`lmms-lab/ChartQA`](https://huggingface.co/datasets/lmms-lab/ChartQA). The original ChartQA dataset is released under the GNU GPL-3.0 license by Masry et al.
+
+## Citation
+
+```
+@inproceedings{masry2022chartqa,
+  title={ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning},
+  author={Masry, Ahmed and Long, Do Xuan and Tan, Jia Qing and Joty, Shafiq and Hoque, Enamul},
+  booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
+  year={2022}
+}
+```