Tracking issues of iceberg-rust v0.3.0

# Iceberg-rust 0.3.0

The main objective of 0.3.0 is to have a working read path (non-exhaustive list :)

- [x] **Scan API** Added by @liurenjie1024 in https://github.com/apache/iceberg-rust/pull/129
- [x] **Predicate pushdown into the Parquet reader** Worked by @viirya in https://github.com/apache/iceberg-rust/pull/295
- [x] **Parquet projection into Arrow streams** Worked on by @viirya in https://github.com/apache/iceberg-rust/pull/245, still some limitations, see PR
- [x] **Manifest pruning on using the `field_summary`**: Skipping data on the highest level by pruning away manifests:
  - [x] **Transforms** added by @marvinlanhenke in https://github.com/apache/iceberg-rust/pull/309
  - [x] **ManifestEvaluator** added by @sdd in https://github.com/apache/iceberg-rust/pull/322
    - [x] **Implement todo's** some of the expressions need to be [implemented](https://github.com/apache/iceberg-rust/blob/aba620900e99423bbd3fed969618e67e58a03a7b/crates/iceberg/src/expr/visitors/manifest_evaluator.rs#L173). Issue in https://github.com/apache/iceberg-rust/issues/350
    - [x] **Tests** port the [test-suite from Python](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/tests/expressions/test_visitors.py#L862-L1477) to Rust
  - [x] **Filter in `TableScan`** in flight by @sdd in https://github.com/apache/iceberg-rust/pull/323
- [x] Skipping manifest-entries within a manifest based on the `102: partition` struct
  - [x] **Accessors** added by @sdd in https://github.com/apache/iceberg-rust/pull/317
  - [x] **Projection** added by @marvinlanhenke in https://github.com/apache/iceberg-rust/pull/309
  - [x] **ExpressionEvaluator** Implement the evaluator worked on by @marvinlanhenke: https://github.com/apache/iceberg-rust/issues/358
  - [x] **Bind `partition-spec`** schema to the [`102: partition` struct and evaluates it](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/expressions/visitors.py#L461-L534).
  - [ ] **Filter in `TableScan`**
- [x] Skip data-files using the metrics evaluator
  - [x] **InclusiveMetricsEvaluator** worked on by @sdd in https://github.com/apache/iceberg-rust/pull/347
  - [x] **InclusiveProjection** added by @sdd in https://github.com/apache/iceberg-rust/pull/322
    - [x] Refactored in https://github.com/apache/iceberg-rust/pull/360
    - [x] Refactored in https://github.com/apache/iceberg-rust/pull/362
  - [ ] **Filter in `TableScan`**
- [x] **Datafusion** Integration with [Apache Datafusion](https://datafusion.apache.org/) to add SQL support: https://github.com/apache/iceberg-rust/issues/357
    - [x] **Initial groundwork** in https://github.com/apache/iceberg-rust/pull/324.
- [ ] **Runtime**
  - [ ] **Parallel loading** https://github.com/apache/iceberg-rust/issues/124

Blocking issues:

- [ ] Field-IDs related:
  - [x] https://github.com/apache/iceberg-rust/issues/338
  - [x] https://github.com/apache/iceberg-rust/issues/131
  - [x] https://github.com/apache/iceberg-rust/issues/353
- [x] https://github.com/apache/iceberg-rust/issues/352

Nice to have (related to the query plan optimizations above):

- [ ] **Implement skipping based on sequence number** [skip `DELETE` manifests](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L1654-L1668) that contain unrelated delete files.
- [ ] **Add support for more fileio** 
(https://github.com/apache/iceberg-rust/issues/408)


State of catalog integration:

- [ ] **Catalog support**
  - [x] **REST Catalog**: First stab by @liurenjie1024 in https://github.com/apache/iceberg-rust/pull/78
    - [x] @Fokko will follow up with IT tests
  - [x] **Glue Catalog** Added by @marvinlanhenke in:
    - [x] https://github.com/apache/iceberg-rust/pull/294
    - [x] https://github.com/apache/iceberg-rust/pull/304
    - [x] https://github.com/apache/iceberg-rust/pull/314
  - [ ] **SQL Catalog** Worked on by @JanKaul in https://github.com/apache/iceberg-rust/pull/229
  - [x] **Hive Catalog** Added by @Xuanwo.
    - [x] Do we want similar [IT tests](https://github.com/apache/iceberg-python/pull/207) as in PyIceberg?

For the release after that, I think the commit path is going to be important.

# Iceberg-rust 0.4.0 and beyond

Nice to have for the 0.3.0 release, but not required. Of course, open for debate.

- [ ] **Support for Positional Deletes** Entails [matching the deletes to the datafiles](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L1605-L1627) based on the statistics.
- [ ] **Support for Equality Deletes** Entails putting the delete files in the right order to apply them in the right sequence.

## Commit path

The commit path entails writing a new metadata JSON.

- [ ] **Applying updates to the metadata** [Updating the metadata](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L706-L956) is important both for writing a new version of the JSON in case of a non-REST catalog, but also to keep an up-to-date version in memory. It is very much recommended to re-use the [Updates](https://github.com/apache/iceberg/blob/866021d7d34f274349ce7de1f29d113395e7f28c/open-api/rest-catalog-open-api.yaml#L2557-L2575)/[Requirement](https://github.com/apache/iceberg/blob/866021d7d34f274349ce7de1f29d113395e7f28c/open-api/rest-catalog-open-api.yaml#L2588-L2605) objects provided by the [REST catalog protocol](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml).
  - [ ] **REST Catalog** [serialize the updates and requirements](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L1364-L1373) into JSON which is [dispatched to the REST catalog](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/rest.py#L675-L706).
  - [ ] **Other catalogs** For the other catalogs, instead of dispatching the updates/requirements to the catalog. There are additional steps:
    - [ ] Logic to [validate the requirements](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/glue.py#L453-L455) against the metadata, to detect commit conflicts.
    - [ ] Writing a new version of the [metadata.json](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/__init__.py#L775-L776).
    - [ ] Provide locking mechanisms within the commit ([Glue](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/glue.py#L476-L483), [Hive](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/hive.py#L379), [SQL](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/sql.py#L426), ..)
- [ ] **Update table properties** Sets [properties on the table](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L326). Probably the best to start with since it doesn't require a complicated API.
- [ ] **Schema evolution** [API](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L1809) to update the schema, and produce new metadata.
- [ ] **Partition spec evolution** [API](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L3003) to update the partition spec, and produce new metadata.
- [ ] **Sort order evolution** API to update the schema, and produce new metadata.

## Metadata tables

Metadata tables are used to [inspect the table](https://iceberg.apache.org/docs/1.5.0/spark-ddl/). Having these tables also allows easy implementation of the [maintenance procedures](https://iceberg.apache.org/docs/1.5.0/spark-procedures/) since you can easily list all the snapshots, and expire the ones that are older than a certain threshold.

## Write support

Most of the work in write support is around generating the correct Iceberg metadata. Some decisions can be made, for example first supporting only FastAppends, and only V2 metadata.

It is common to have multiple snapshots in a single commit to the catalog. For example, an overwrite operation of a partition can be a delete + append operation. This makes the implementation easier since you can separate the problems, and tackle them one by one. Also, for the roadmap it makes it easier since their operations can be developed in parallel.

- [ ] **Commit semantics**
  - [ ] **MergeAppend** appends new manifest list entries to existing manifest files. Reduces the amount of metadata produced, but takes some more time to commit since existing metadata has to be rewritten, and retries are also more costly.
  - [ ] **FastAppend** Generates a new manifest per commit, which allows fast commits, but generates more metadata in the long run. PR by @ZENOTME in https://github.com/apache/iceberg-rust/pull/349
- [ ] **Snapshot generation** manipulation of data within a table is done by [appending snapshots](https://iceberg.apache.org/spec/#snapshots) to the metadata JSON.
  - [ ] **APPEND** Only data files were added and no files were removed.
  - [ ] **REPLACE** Data and delete files were added and removed without changing table data; i.e., compaction, changing the data file format, or relocating data files.
  - [ ] **OVERWRITE** Data and delete files were added and removed in a logical overwrite operation.
  - [ ] **DELETE** Data files were removed and their contents logically deleted and/or delete files were added to delete rows.
- [ ] **Add files** to add existing Parquet files to a table. Issue in https://github.com/apache/iceberg-rust/issues/345
  - [ ] [**Name mapping**](https://iceberg.apache.org/spec/#column-projection) in case the files don't have field-IDs set.
- [ ] [**Summary generations**] Part of the snapshot that indicates what's in the snapshot.
- [ ] **Metrics collection** There are two situations:
  - [ ] **Collect metrics when writing** This is done with the Java API where during writing the upper, lower bound are tracked and the number of null- and nan records are counted.
  - [ ] **Collect metrics from footer** When an existing file is added, the footer of the Parquet file is opened to reconstruct all the metrics needed for Iceberg.
- [ ] **Deletes** This mainly relies on strict projection to check if the data files cannot match with the predicate.
  - [ ] **Strict projection** needs to be added to the [transforms](https://github.com/apache/iceberg-python/pull/539).
  - [ ] **Strict Metrics Evaluator** to determine if the predicate [cannot match](https://github.com/apache/iceberg-python/pull/518).

## Future topics

- [ ] **Python bindings**
- [ ] **WASM** to run Iceberg-rust in the browser

## Contribute

If you want to contribute to the upcoming milestone, feel free to comment on this issue. If there is anything unclear or missing, feel free to reach out here as well 👍 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issues of iceberg-rust v0.3.0 #348

Iceberg-rust 0.3.0

Iceberg-rust 0.4.0 and beyond

Commit path

Metadata tables

Write support

Future topics

Contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tracking issues of iceberg-rust v0.3.0 #348

Description

Iceberg-rust 0.3.0

Iceberg-rust 0.4.0 and beyond

Commit path

Metadata tables

Write support

Future topics

Contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions