support revision/branches and transformation can return None to skip …#769
support revision/branches and transformation can return None to skip …#769mpeex wants to merge 3 commits intoLightning-AI:mainfrom
Conversation
for more information, see https://pre-commit.ci
| item = self.transform(item) | ||
|
|
||
| return item | ||
| return item if item else self.__next__() |
There was a problem hiding this comment.
It makes more sense to have this in __next__ method, and not in __getitem__.
|
Hi @mpeex, can you share more details on when If this is your specific use case, one option can be to override the class MyDS(StreamingDataset):
def __next__(self, index):
item = super().__next__(self)
return item if item else self.__next__()
ds = MyDs(...)Also, can you write some tests verifying the behaviour for |
|
hi @deependujha thanks for reviewing. The usecase is to filter samples while streaming using the transform function, for ex: returning None in any of the I will write some tests to verify the behaviour for HF dataset revision/branches. thanks for your time |
|
I think that skipping samples, whether that's via |
|
considering @philgzl comment, what is than the correct way to filter out samples if using |
@mpeex — since this appears to be specific to your filtering use case, one option is to use the approach suggested by @deependujha and override the
To fully avoid these caveats, I’d recommend filtering during the |
|
@mpeex, would you be open to opening a separate PR that only contains the HuggingFace revision/branch support? Let me know if that works for you, and thanks again for working through this 👍 |
|
Having a separate PR for HF revision/branches makes more sense. |
…sample
Before submitting
What does this PR do?
Fixes # (issue).
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun? oh ye
Make sure you had fun coding 🙃