GH-3011: Deny further writes after InternalParquetRecordWriter is aborted#3450
GH-3011: Deny further writes after InternalParquetRecordWriter is aborted#3450wgtmac merged 3 commits intoapache:masterfrom
Conversation
|
Hi @LuciferYang , since we check |
Hi @Jiayi-Wang-db, thanks for the question! You're right that the The concern here is more about silent data loss. Consider this scenario:
The user may not realize any data was lost until they read the file later and find records missing. By throwing an |
|
Hi @LuciferYang , thanks for the clarification. Yes, it seems like it could silently swallow the exception. |
|
I agree with @Jiayi-Wang-db that once any exception reported by the writer, users should not continue to call |
|
Hi @wgtmac, agree — calling Right now the writer quietly accepts all writes after abort, and
That said, if you think this isn't worth the change, happy to close this PR and the issue. Let me know. |
wgtmac
left a comment
There was a problem hiding this comment.
Thanks @LuciferYang for the quick response! I think overall the fix in this PR is benign so I'm fine with it.
|
Thank you @wgtmac ~ |
Rationale for this change
When a write fails (e.g. OOM during page flush),
InternalParquetRecordWritersetsaborted = trueand re-throws. But if the caller catches the exception and keeps callingwrite(), nothing stops them — all calls go through without error. Thenclose()seesabortedand skips the flush, so all those records are just gone. The caller won't know until they read the file back and find data missing.Granted, writing after an error is the caller's fault. But silently discarding data makes it much harder to track down what went wrong.
java.io.FileOutputStreamhandles a similar situation the same way —write()afterclose()is also misuse, but it still throwsIOException("Stream Closed")instead of quietly accepting bytes that will never be persisted.What changes are included in this PR?
Check
abortedat the top ofInternalParquetRecordWriter.write(). If aborted, throwIOExceptionright away.Are these changes tested?
Added
testWriteAfterAbortShouldThrowinTestParquetWriterError:IOExceptionwith expected messageclose()on an aborted writer → no exceptionExisting tests in
parquet-hadoopstill pass.Are there any user-facing changes?
If you used to catch a write exception and keep writing to the same
ParquetWriter, you'll now get anIOExceptionon the nextwrite(). Fix: discard the writer and create a new one.Closes #3011