diff --git a/docs/guides/DataframesViaArrow.md b/docs/guides/DataframesViaArrow.md new file mode 100644 index 00000000..9a92b974 --- /dev/null +++ b/docs/guides/DataframesViaArrow.md @@ -0,0 +1,110 @@ +# Working with DataFrames via Arrow + +ParquetSharp now provides Arrow-based APIs for reading and working with `.NET DataFrame objects`. Using Arrow can improve performance and reduce unnecessary memory copies. **However, there are limitations**. + +## Prerequisites + +You'll need these packages: +```xml + + + +``` + +## Reading a Single Batch from Parquet + +Arrow integration works reliably for reading a single batch. Here's how to read one batch and convert it to a DataFrame: + +```csharp +using ParquetSharp.Arrow; +using Microsoft.Data.Analysis; +using Apache.Arrow; + +using var fileReader = new FileReader("sample.parquet"); +using var batchReader = fileReader.GetRecordBatchReader(); + +var batch = await batchReader.ReadNextRecordBatchAsync(); +if (batch != null) +{ + using (batch) + { + var df = DataFrame.FromArrowRecordBatch(batch).Clone(); + Console.WriteLine($"Rows: {df.Rows.Count}, Columns: {df.Columns.Count}"); + Console.WriteLine(df.Head(5)); + } +} +``` + +This works reliably for all standard DataFrames. + + +## Reading All Batches Separately +For files with multiple batches, each batch can be converted into a DataFrame individually. + +**Note**: Combining multiple batches using `Append()` is unreliable... Particularly with string columns. + +```csharp +using var fileReader = new FileReader("sample.parquet"); +using var batchReader = fileReader.GetRecordBatchReader(); + +var dataFrames = new List(); +RecordBatch batch; + +while ((batch = await batchReader.ReadNextRecordBatchAsync()) != null) +{ + using (batch) + { + var df = DataFrame.FromArrowRecordBatch(batch).Clone(); + dataFrames.Add(df); + } +} + +Console.WriteLine($"Read {dataFrames.Count} batch(es)"); +foreach (var df in dataFrames) +{ + Console.WriteLine("\nDataFrame Batch:"); + Console.WriteLine($"Rows: {df.Rows.Count}, Columns: {df.Columns.Count}"); + Console.WriteLine(df.Head(5)); +} +``` + +## Key Notes + +- **Clone to avoid disposal issues:** Each DataFrame should be cloned to remain valid after the batch is disposed. + +- **Do not rely on merging Arrow DataFrames:** Append and combining multiple batches is unreliable, particularly with string columns. + +## Writing DataFrames to Parquet + +- ToArrowRecordBatches() is not reliable for string columns. +- For safe writing, continue using ParquetSharp.DataFrame: + +```csharp +using var reader = new ParquetSharp.ParquetReader("input.parquet"); +var df = reader.ToDataFrame(); +``` + +## When to Use Arrow vs ParquetSharp.DataFrame + +| Task | Arrow API | ParquetSharp.DataFrame | +|------|-----------|------------------------| +| **Reading** Parquet to DataFrame | ✅ Recommended - Faster, less memory copying | ✅ Works - Simple one-line API | +| **Writing** DataFrame to Parquet | ❌ Unreliable - Fails with string columns | ✅ Recommended - Reliable for all column types | +| **String columns** | ⚠️ Read-only support | ✅ Full read/write support | +| **Merging batches** | ❌ `Append()` is unreliable | ✅ Works reliably | +| **Performance** | ⚠️ Faster for reads only | ⚠️ Slower but more reliable | +| **Use case** | Large file reads, streaming | Writing, string data, combining data | + +## Key Takeaways + +- **Arrow + FromArrowRecordBatch()** is safe and faster for reading Parquet files into DataFrames. +- **ParquetSharp.DataFrame is more reliable** for writing DataFrames back to Parquet. +- `ToArrowRecordBatches()` and `Append()` are unreliable for writing or merging batches.. +- **Writing and combining DataFrames** still requires `ParquetSharp.DataFrame`. + +## See Also + +For more details, check out: +- [ParquetSharp Arrow API Documentation](https://g-research.github.io/ParquetSharp/guides/Arrow.html) +- [DataFrame.FromArrowRecordBatch Method](https://learn.microsoft.com/en-us/dotnet/api/microsoft.data.analysis.dataframe.fromarrowrecordbatch?view=ml-dotnet-preview) +- [DataFrame.ToArrowRecordBatches Method](https://learn.microsoft.com/en-us/dotnet/api/microsoft.data.analysis.dataframe.toarrowrecordbatches?view=ml-dotnet-preview) diff --git a/docs/index.md b/docs/index.md index 684f0bce..1492033d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -162,6 +162,7 @@ For more detailed information on how to use ParquetSharp, see the following guid * [Reading Parquet files](guides/Reading.md) * [Working with nested data](guides/Nested.md) * [Reading and writing Arrow data](guides/Arrow.md) — how to read and write data using the [Apache Arrow format](https://arrow.apache.org/) +* [Working with DataFrames via Arrow](guides/DataframesViaArrow.md) * [Row-oriented API](guides/RowOriented.md) — a higher level API that abstracts away the column-oriented nature of Parquet files * [Custom types](guides/TypeFactories.md) — how to customize the mapping between .NET and Parquet types, including using the `DateOnly` and `TimeOnly` types added in .NET 6.