Skip to content

Write Deletion Vectors#2822

Closed
rambleraptor wants to merge 6 commits intoapache:mainfrom
rambleraptor:deletion_vector_write
Closed

Write Deletion Vectors#2822
rambleraptor wants to merge 6 commits intoapache:mainfrom
rambleraptor:deletion_vector_write

Conversation

@rambleraptor
Copy link
Copy Markdown
Contributor

Part of #2261

Rationale for this change

This adds a PuffinWriter for writing deletion vectors.

Right now, it's just the writer class + some round trip tests (where we read + write the same file) to sanity check that the PuffinWriter works as expected. Writing Puffin files is very complex, so I wanted to make sure we all agreed on the writing semantics before using this elsewhere.

Let me know your thoughts on this (or if it's too granular)

Are these changes tested?

Unit tests included

Are there any user-facing changes?

Comment thread pyiceberg/table/puffin.py
Comment thread tests/table/test_puffin.py Outdated
Comment thread pyiceberg/table/puffin.py Outdated
Comment thread pyiceberg/table/puffin.py
@geruh geruh mentioned this pull request Dec 10, 2025
14 tasks
Comment thread pyiceberg/table/puffin.py Outdated
bitmaps: dict[int, BitMap] = {}
cardinality = 0
for pos in positions:
cardinality += 1
Copy link
Copy Markdown
Member

@geruh geruh Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cardinality could be incorrect with same positions passed in we can probably use the pyroaring stats to get this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I made the proper change, let me know if you're thinking differently.

@glesperance
Copy link
Copy Markdown

glesperance commented Dec 11, 2025

Hey @rambleraptor, I was working on a DV implementation before discovering this PR. Since review is already underway, I'd rather contribute here than duplicate effort.

I've added a Spark interoperability test: glesperance/iceberg-python@c25fe312

This verifies pyiceberg can read Spark-written DVs. Combined with your existing round-trip tests, this confirms format compatibility... ie if the same reader handles both, Spark can read ours too.

This may be redundant with your existing .bin fixture tests, though I believe those test the raw bitmap format rather than full Puffin DVs with the Java wrapper (length + magic + CRC). Let me know if I'm wrong on that.

Happy to PR to your fork if you think it's pertinent -or- feel free to cherry pick the commit as you see fit.

Verify pyiceberg's PuffinFile reader can parse deletion vectors written
by Spark. Uses coalesce(1) to force Spark to create DVs instead of COW.
@rambleraptor
Copy link
Copy Markdown
Contributor Author

@glesperance Thanks so much! I patched in your commit and I'll push it up along with my changes. Your name should appear in the commit log + PR. Let me know if you don't see it.

@rambleraptor
Copy link
Copy Markdown
Contributor Author

PR comments have been addressed.

@geruh it looks like your work on DeleteFileIndexes will be very useful for determing offsets + lengths on the blobs!

@rambleraptor rambleraptor requested review from ebyhr and geruh December 12, 2025 00:46
@glesperance
Copy link
Copy Markdown

glesperance commented Dec 16, 2025

Really excited to see this moving forward.

@rambleraptor Thanks for the opportunity to contribute and for handling the updates. On full DV support, have you started on the delete/manifest writers for v3 and the MOR logic?

I’ve got a working PoC with some tests but it’ll certainly need more polish before it's PR-ready.
I also need to rebase it on top of #2822 (this) and #2180

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Mar 18, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions Bot closed this Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants