Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# Paths
SCRIPT := scripts/mdx_snippets_gen.py
HF_SYNC_SCRIPT := scripts/sync_hf_datasets.py

# uv run automatically handles virtualenv, so no activation needed
.PHONY: py ts rs snippets
.PHONY: py ts rs snippets hf-sync

# Generate Python MDX snippets
py:
Expand All @@ -19,3 +20,9 @@ rs:
# Convenience: generate all snippets
snippets: py ts rs

# Sync Lance dataset cards from lance-format/lance-huggingface into docs/datasets/.
# Regenerates per-dataset MDX pages, the landing-page card grid, and the
# Datasets tab in docs.json based on scripts/hf_datasets.yaml.
hf-sync:
@uv run $(HF_SYNC_SCRIPT)

42 changes: 42 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,3 +67,45 @@ code that's been tested (per recent LanceDB releases) in the hands of users.
> As far as possible, do not add code snippets manually inside triple-backticks! Write the tests for
> the required language in `tests/*` directory, then generate the snippets programmatically via the Makefile
> commands.

## Sync Hugging Face dataset pages

The `Datasets` tab is populated from [`lance-format/lance-huggingface`](https://github.com/lance-format/lance-huggingface),
the master repository where each Lance dataset published under the [`lance-format`](https://huggingface.co/lance-format)
Hugging Face organization has its own directory with an `HF_DATASET_CARD.md`. That same file is what gets pushed to
the Hub as the dataset's `README.md` via the `hf` CLI, so the GitHub repo is the single source of truth for the
content of every dataset card.

To avoid maintaining the same content in two places, the per-dataset MDX pages under `docs/datasets/` are
generated from those upstream cards via `scripts/sync_hf_datasets.py`. The script:

1. Reads `scripts/hf_datasets.yaml`, which lists every dataset to publish and maps the upstream directory name,
the URL slug, the HF Hub repo, and the human-readable title.
2. Fetches each `HF_DATASET_CARD.md` from `lance-format/lance-huggingface` on GitHub.
3. Rewrites the frontmatter for Mintlify (sets `title`, `sidebarTitle`, `description`), strips the upstream H1,
injects a "View on Hugging Face" card at the top, and sanitizes known MDX hazards (bibtex citations outside
code fences, literal `<>` in prose).
4. Writes `docs/datasets/<slug>.mdx`, regenerates the card grid in `docs/datasets/index.mdx` between the
`HF_SYNC:START` / `HF_SYNC:END` markers, and updates the `Datasets` tab in `docs/docs.json` to keep the
sidebar in sync.

Run it from the repo root:

```bash
make hf-sync
```

### Adding a new dataset

1. Author the new dataset's `HF_DATASET_CARD.md` upstream in `lance-format/lance-huggingface` (and push it to the
Hub as usual).
2. Add a single line for the dataset under the appropriate category in `scripts/hf_datasets.yaml`. The four
fields (`dir`, `slug`, `hf`, `title`) are explicit because the GitHub directory name, the HF Hub repo slug,
and the desired URL slug don't follow a derivable convention.
3. Run `make hf-sync`. The script will fetch the new card, generate `docs/datasets/<slug>.mdx`, refresh the
landing-page card grid, and add the new page to the `Datasets` tab in `docs/docs.json`.
4. Preview locally with `mint dev` and commit the changes (the MDX page, the regenerated `index.mdx`, the
updated `docs.json`, and the new yaml entry).

If you remove a dataset from the yaml, the next `make hf-sync` will delete its MDX file and drop the sidebar
entry. The script hard-fails on any fetch error — partial regeneration would be worse than a clear error.
169 changes: 169 additions & 0 deletions docs/datasets/ade20k.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
---
title: "ADE20K"
sidebarTitle: "ADE20K"
description: "Lance-formatted version of the full ADE20K scene parsing benchmark (sourced from 1aurent/ADE20K) — 27,574 scene images with semantic and instance segmentation maps, scene labels, and per-object metadata, all stored inline."
---

<Card
title="View on Hugging Face"
icon="/static/assets/logo/huggingface-logo.svg"
href="https://huggingface.co/datasets/lance-format/ade20k-lance"
>
Source dataset card and downloadable files for `lance-format/ade20k-lance`.
</Card>

Lance-formatted version of the full [ADE20K scene parsing benchmark](https://groups.csail.mit.edu/vision/datasets/ADE20K/) (sourced from [`1aurent/ADE20K`](https://huggingface.co/datasets/1aurent/ADE20K)) — **27,574 scene images** with semantic and instance segmentation maps, scene labels, and per-object metadata, all stored inline.

## Splits

| Split | Rows |
|-------|------|
| `train.lance` | 25,574 |
| `validation.lance` | 2,000 |

## Schema

| Column | Type | Notes |
|---|---|---|
| `id` | `int64` | Row index within split |
| `image` | `large_binary` | Inline JPEG bytes |
| `segmentation` | `large_binary` | Inline PNG bytes — semantic segmentation map (RGB encoding per ADE20K spec) |
| `instance` | `large_binary?` | Inline PNG bytes — instance map; null if not provided |
| `filename` | `string` | ADE20K relative filename |
| `scene` | `list<string>` | Scene labels (e.g. `["bathroom"]`) |
| `object_names` | `list<string>` | Names of all annotated objects (one entry per polygon) |
| `objects_present` | `list<string>` | Deduped object names — feeds the `LABEL_LIST` index |
| `num_objects` | `int32` | Number of annotated objects |
| `image_emb` | `fixed_size_list<float32, 512>` | OpenCLIP `ViT-B-32` image embedding (cosine-normalized) |

## Pre-built indices

- `IVF_PQ` on `image_emb` — `metric=cosine`
- `BTREE` on `num_objects`
- `LABEL_LIST` on `objects_present` — supports `array_has_any` / `array_has_all`

## Quick start

```python
import lance

ds = lance.dataset("hf://datasets/lance-format/ade20k-lance/data/validation.lance")
print(ds.count_rows(), ds.schema.names, ds.list_indices())
```

## Load with LanceDB

These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.

```python
import lancedb

db = lancedb.connect("hf://datasets/lance-format/ade20k-lance/data")
tbl = db.open_table("validation")
print(f"LanceDB table opened with {len(tbl)} scene images")
```

## Read an image with its segmentation

```python
import io
import lance
from PIL import Image

ds = lance.dataset("hf://datasets/lance-format/ade20k-lance/data/validation.lance")
row = ds.take([0], columns=["image", "segmentation", "scene", "objects_present"]).to_pylist()[0]

Image.open(io.BytesIO(row["image"])).save("img.jpg")
Image.open(io.BytesIO(row["segmentation"])).save("seg.png")
print("scene:", row["scene"])
print("objects:", row["objects_present"][:10])
```

## Filter by scene / objects

```python
import lance
ds = lance.dataset("hf://datasets/lance-format/ade20k-lance/data/validation.lance")

# Indoor scenes containing both a bed and a window.
rows = ds.scanner(
filter="array_has_all(objects_present, ['bed', 'window'])",
columns=["filename", "scene"],
limit=10,
).to_table().to_pylist()
```

### Filter with LanceDB

```python
import lancedb

db = lancedb.connect("hf://datasets/lance-format/ade20k-lance/data")
tbl = db.open_table("validation")

rows = (
tbl.search()
.where("array_has_all(objects_present, ['bed', 'window'])")
.select(["filename", "scene"])
.limit(10)
.to_list()
)
```

## Visual similarity search

```python
import lance
import pyarrow as pa

ds = lance.dataset("hf://datasets/lance-format/ade20k-lance/data/validation.lance")
emb_field = ds.schema.field("image_emb")
ref = ds.take([0], columns=["image_emb"]).to_pylist()[0]["image_emb"]
query = pa.array([ref], type=emb_field.type)

neighbors = ds.scanner(
nearest={"column": "image_emb", "q": query[0], "k": 5},
columns=["filename", "scene"],
).to_table().to_pylist()
```

### LanceDB visual similarity search

```python
import lancedb

db = lancedb.connect("hf://datasets/lance-format/ade20k-lance/data")
tbl = db.open_table("validation")

ref = tbl.search().limit(1).select(["image_emb"]).to_list()[0]
query_embedding = ref["image_emb"]

results = (
tbl.search(query_embedding)
.metric("cosine")
.select(["filename", "scene"])
.limit(5)
.to_list()
)
```

## Why Lance?

- One dataset for images + segmentation + instance + scene + objects + embeddings + indices — no folder of paired files.
- On-disk vector and label-list indices live next to the data, so search works on local copies and on the Hub.
- Schema evolution: add columns (panoptic ids, fresh embeddings, model predictions) without rewriting the data.

## Source & license

Converted from [`1aurent/ADE20K`](https://huggingface.co/datasets/1aurent/ADE20K). ADE20K is released under the [BSD 3-Clause license](https://github.com/CSAILVision/ADE20K/blob/main/LICENSE) by the MIT CSAIL Computer Vision group.

## Citation

```
@inproceedings{zhou2017scene,
title={Scene Parsing through ADE20K Dataset},
author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2017}
}
```
114 changes: 114 additions & 0 deletions docs/datasets/chartqa.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
title: "ChartQA"
sidebarTitle: "ChartQA"
description: "Lance-formatted version of ChartQA — VQA over scientific and business charts that combine logical and visual reasoning — sourced from lmms-lab/ChartQA."
---

<Card
title="View on Hugging Face"
icon="/static/assets/logo/huggingface-logo.svg"
href="https://huggingface.co/datasets/lance-format/chartqa-lance"
>
Source dataset card and downloadable files for `lance-format/chartqa-lance`.
</Card>

Lance-formatted version of [ChartQA](https://github.com/vis-nlp/ChartQA) — VQA over scientific and business charts that combine logical and visual reasoning — sourced from [`lmms-lab/ChartQA`](https://huggingface.co/datasets/lmms-lab/ChartQA).

## Splits

| Split | Rows |
|-------|------|
| `test.lance` | 2,500 |

> The `lmms-lab/ChartQA` redistribution exposes test only. Train and validation live in the original release (https://github.com/vis-nlp/ChartQA); add them via `chartqa/dataprep.py --splits` once a parquet mirror is identified.

## Schema

| Column | Type | Notes |
|---|---|---|
| `id` | `int64` | Row index |
| `image` | `large_binary` | Inline chart image bytes |
| `image_id` / `question_id` | `string?` | (Source does not assign explicit ids — null for now) |
| `question` | `string` | Natural-language question |
| `answers` | `list<string>` | Reference answer (typically a single string) |
| `answer` | `string` | First answer — used as canonical |
| `type` | `string?` | Question type (`human` vs `augmented`) |
| `image_emb` | `fixed_size_list<float32, 512>` | CLIP image embedding (cosine-normalized) |
| `question_emb` | `fixed_size_list<float32, 512>` | CLIP text embedding of the question |

## Pre-built indices

- `IVF_PQ` on `image_emb` and `question_emb` — `metric=cosine`
- `INVERTED` (FTS) on `question` and `answer`
- `BITMAP` on `type`

## Quick start

```python
import lance
ds = lance.dataset("hf://datasets/lance-format/chartqa-lance/data/test.lance")
print(ds.count_rows(), ds.schema.names, ds.list_indices())
```

## Load with LanceDB

These tables can also be consumed by [LanceDB](https://lancedb.github.io/lancedb/), the multimodal lakehouse and embedded search library built on top of Lance, for simplified vector search and other queries.

```python
import lancedb

db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data")
tbl = db.open_table("test")
print(f"LanceDB table opened with {len(tbl)} chart-question pairs")
```

### LanceDB vector search

```python
import lancedb

db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data")
tbl = db.open_table("test")

ref = tbl.search().limit(1).select(["question_emb", "question"]).to_list()[0]
query_embedding = ref["question_emb"]

results = (
tbl.search(query_embedding, vector_column_name="question_emb")
.metric("cosine")
.select(["question", "answer"])
.limit(5)
.to_list()
)
```

### LanceDB full-text search

```python
import lancedb

db = lancedb.connect("hf://datasets/lance-format/chartqa-lance/data")
tbl = db.open_table("test")

results = (
tbl.search("percentage")
.select(["question", "answer"])
.limit(10)
.to_list()
)
```

## Source & license

Converted from [`lmms-lab/ChartQA`](https://huggingface.co/datasets/lmms-lab/ChartQA). The original ChartQA dataset is released under the GNU GPL-3.0 license by Masry et al.

## Citation

```
@inproceedings{masry2022chartqa,
title={ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning},
author={Masry, Ahmed and Long, Do Xuan and Tan, Jia Qing and Joty, Shafiq and Hoque, Enamul},
booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
year={2022}
}
```
Loading
Loading