From 3454ebc96866f4b248b2a0b4e64358c28f215e82 Mon Sep 17 00:00:00 2001 From: Fortunate Omonuwa Date: Mon, 17 Nov 2025 16:01:15 +0100 Subject: [PATCH 1/8] docs: add documentation on how to read and write via arrow --- docs/guides/DataframesViaArrow.md | 121 ++++++++++++++++++++++++++++++ docs/index.md | 1 + 2 files changed, 122 insertions(+) create mode 100644 docs/guides/DataframesViaArrow.md diff --git a/docs/guides/DataframesViaArrow.md b/docs/guides/DataframesViaArrow.md new file mode 100644 index 00000000..52754f12 --- /dev/null +++ b/docs/guides/DataframesViaArrow.md @@ -0,0 +1,121 @@ + +## Working with DataFrames via Arrow + +ParquetSharp now provides built-in Arrow support, offering a more efficient way to work with `.NET DataFrame` objects. By using Arrow as the intermediate format, data can move from Parquet to a DataFrame with minimal overhead and without unnecessary conversions. + + +### Prerequisites + +You'll need these packages: +```xml + + +``` + +### Reading Parquet Files to DataFrames + +Here's how to read a Parquet file into a DataFrame using Arrow: + +```csharp +using ParquetSharp.Arrow; +using Microsoft.Data.Analysis; +using Apache.Arrow; + +using var fileReader = new FileReader("data.parquet"); +using var batchReader = fileReader.GetRecordBatchReader(); + +RecordBatch batch = await batchReader.ReadNextRecordBatchAsync(); +if (batch != null) +{ + using (batch) + { + var dataFrame = DataFrame.FromArrowRecordBatch(batch); + Console.WriteLine($"Rows: {dataFrame.Rows.Count}, Columns: {dataFrame.Columns.Count}"); + Console.WriteLine(dataFrame.Head(5)); + } +} +``` + +This example reads a single Arrow RecordBatch from the Parquet file and converts it directly into a DataFrame. After conversion, the DataFrame can be inspected or used as needed. + +If the file contains multiple batches, they can be read and merged into one DataFrame: + +```csharp +using var fileReader = new FileReader("data.parquet"); +using var batchReader = fileReader.GetRecordBatchReader(); + +DataFrame combinedDataFrame = null; +RecordBatch batch; + +while ((batch = await batchReader.ReadNextRecordBatchAsync()) != null) +{ + using (batch) + { + var df = DataFrame.FromArrowRecordBatch(batch); + + if (combinedDataFrame == null) + { + combinedDataFrame = df; + } + else + { + combinedDataFrame = combinedDataFrame.Append(df.Rows); + } + } +} + +var summary = combinedDataFrame.Description(); +Console.WriteLine(summary); +``` + +This approach processes the file batch-by-batch. Each batch is converted into a DataFrame, and the individual DataFrames are appended together, producing a single combined dataset. + +### Writing DataFrames to Parquet Files + +To write a DataFrame to Parquet via Arrow, it is first converted into Arrow RecordBatch objects. The first batch provides the schema required to initialize the writer. All batches are then written sequentially to the output file. + +```csharp +using ParquetSharp.Arrow; +using Microsoft.Data.Analysis; + +var recordBatches = dataFrame.ToArrowRecordBatches(); +var firstBatch = recordBatches.FirstOrDefault(); +if (firstBatch == null) +{ + return; +} + +using var writer = new FileWriter("output.parquet", firstBatch.Schema); + +foreach (var batch in recordBatches) +{ + writer.WriteRecordBatch(batch); +} + +writer.Close(); +``` + +### When to Use Arrow vs ParquetSharp.DataFrame + +**Use the Arrow approach when:** + +- You want higher performance and reduced memory copying +- You are working with large or streaming datasets +- You prefer compatibility with the broader Arrow ecosystem + +**You might still use ParquetSharp.DataFrame if:** + +- You need the simple one-line API: `parquetReader.ToDataFrame()` +- You're working with small files where performance doesn't matter +- You have existing code that already uses it + +### Performance Notes + +The Arrow approach is faster because DataFrames internally use Arrow's memory layout. When you use ParquetSharp.DataFrame, the data gets converted from Parquet → .NET types → DataFrame, but with the Arrow API it goes directly from Parquet → Arrow → DataFrame with zero-copy operations where possible. + +## See Also + +For more details, check out: +- [ParquetSharp Arrow API Documentation](https://g-research.github.io/ParquetSharp/guides/Arrow.html) +- [DataFrame.FromArrowRecordBatch Method](https://learn.microsoft.com/en-us/dotnet/api/microsoft.data.analysis.dataframe.fromarrowrecordbatch?view=ml-dotnet-preview) +- [DataFrame.ToArrowRecordBatches Method](https://learn.microsoft.com/en-us/dotnet/api/microsoft.data.analysis.dataframe.toarrowrecordbatches?view=ml-dotnet-preview) \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index 684f0bce..1492033d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -162,6 +162,7 @@ For more detailed information on how to use ParquetSharp, see the following guid * [Reading Parquet files](guides/Reading.md) * [Working with nested data](guides/Nested.md) * [Reading and writing Arrow data](guides/Arrow.md) — how to read and write data using the [Apache Arrow format](https://arrow.apache.org/) +* [Working with DataFrames via Arrow](guides/DataframesViaArrow.md) * [Row-oriented API](guides/RowOriented.md) — a higher level API that abstracts away the column-oriented nature of Parquet files * [Custom types](guides/TypeFactories.md) — how to customize the mapping between .NET and Parquet types, including using the `DateOnly` and `TimeOnly` types added in .NET 6. From 747c7984a7cb6aeba01407af392a30cfc845b772 Mon Sep 17 00:00:00 2001 From: Fortunate Omonuwa Date: Wed, 19 Nov 2025 16:07:27 +0100 Subject: [PATCH 2/8] docs: update docs --- docs/guides/DataframesViaArrow.md | 100 ++++++++++++++---------------- 1 file changed, 45 insertions(+), 55 deletions(-) diff --git a/docs/guides/DataframesViaArrow.md b/docs/guides/DataframesViaArrow.md index 52754f12..e1a35a64 100644 --- a/docs/guides/DataframesViaArrow.md +++ b/docs/guides/DataframesViaArrow.md @@ -1,8 +1,7 @@ ## Working with DataFrames via Arrow -ParquetSharp now provides built-in Arrow support, offering a more efficient way to work with `.NET DataFrame` objects. By using Arrow as the intermediate format, data can move from Parquet to a DataFrame with minimal overhead and without unnecessary conversions. - +ParquetSharp now provides Arrow-based APIs for reading and working with `.NET DataFrame objects`. Using Arrow can improve performance and reduce unnecessary memory copies. **However, there are limitations**. ### Prerequisites @@ -12,106 +11,97 @@ You'll need these packages: ``` -### Reading Parquet Files to DataFrames +### Reading a Single Batch from Parquet -Here's how to read a Parquet file into a DataFrame using Arrow: +Arrow integration works reliably for reading a single batch. Here's how to read one batch and convert it to a DataFrame: ```csharp using ParquetSharp.Arrow; using Microsoft.Data.Analysis; using Apache.Arrow; -using var fileReader = new FileReader("data.parquet"); +using var fileReader = new FileReader("sample.parquet"); using var batchReader = fileReader.GetRecordBatchReader(); -RecordBatch batch = await batchReader.ReadNextRecordBatchAsync(); +var batch = await batchReader.ReadNextRecordBatchAsync(); if (batch != null) { using (batch) { - var dataFrame = DataFrame.FromArrowRecordBatch(batch); - Console.WriteLine($"Rows: {dataFrame.Rows.Count}, Columns: {dataFrame.Columns.Count}"); - Console.WriteLine(dataFrame.Head(5)); + var df = DataFrame.FromArrowRecordBatch(batch).Clone(); + Console.WriteLine($"Rows: {df.Rows.Count}, Columns: {df.Columns.Count}"); + Console.WriteLine(df.Head(5)); } } ``` -This example reads a single Arrow RecordBatch from the Parquet file and converts it directly into a DataFrame. After conversion, the DataFrame can be inspected or used as needed. +This works reliably for all standard DataFrames. + -If the file contains multiple batches, they can be read and merged into one DataFrame: +### Reading All Batches Separately +For files with multiple batches, each batch can be converted into a DataFrame individually. +**Note**: Do not try to merge batches into a single DataFrame using Append; it is unreliable, especially with string columns. ```csharp -using var fileReader = new FileReader("data.parquet"); +using var fileReader = new FileReader("sample.parquet"); using var batchReader = fileReader.GetRecordBatchReader(); -DataFrame combinedDataFrame = null; +var dataFrames = new List(); RecordBatch batch; while ((batch = await batchReader.ReadNextRecordBatchAsync()) != null) { using (batch) { - var df = DataFrame.FromArrowRecordBatch(batch); - - if (combinedDataFrame == null) - { - combinedDataFrame = df; - } - else - { - combinedDataFrame = combinedDataFrame.Append(df.Rows); - } + var df = DataFrame.FromArrowRecordBatch(batch).Clone(); + dataFrames.Add(df); } } -var summary = combinedDataFrame.Description(); -Console.WriteLine(summary); +Console.WriteLine($"Read {dataFrames.Count} batch(es)"); +foreach (var df in dataFrames) +{ + Console.WriteLine("\nDataFrame Batch:"); + Console.WriteLine($"Rows: {df.Rows.Count}, Columns: {df.Columns.Count}"); + Console.WriteLine(df.Head(5)); +} ``` -This approach processes the file batch-by-batch. Each batch is converted into a DataFrame, and the individual DataFrames are appended together, producing a single combined dataset. +### Key Notes -### Writing DataFrames to Parquet Files +- **Clone to avoid disposal issues:** Each DataFrame should be cloned to remain valid after the batch is disposed. -To write a DataFrame to Parquet via Arrow, it is first converted into Arrow RecordBatch objects. The first batch provides the schema required to initialize the writer. All batches are then written sequentially to the output file. +- **Do not rely on merging Arrow DataFrames:** Append and combining multiple batches is unreliable, particularly with string columns. -```csharp -using ParquetSharp.Arrow; -using Microsoft.Data.Analysis; +### Writing DataFrames to Parquet -var recordBatches = dataFrame.ToArrowRecordBatches(); -var firstBatch = recordBatches.FirstOrDefault(); -if (firstBatch == null) -{ - return; -} - -using var writer = new FileWriter("output.parquet", firstBatch.Schema); - -foreach (var batch in recordBatches) -{ - writer.WriteRecordBatch(batch); -} - -writer.Close(); +- ToArrowRecordBatches() is not reliable for string columns. +- For safe writing, continue using ParquetSharp.DataFrame: + +```csharp +using var reader = new ParquetSharp.ParquetReader("input.parquet"); +var df = reader.ToDataFrame(); ``` ### When to Use Arrow vs ParquetSharp.DataFrame **Use the Arrow approach when:** -- You want higher performance and reduced memory copying -- You are working with large or streaming datasets -- You prefer compatibility with the broader Arrow ecosystem +- Reading Parquet data into DataFrames and you want a more efficient way to do this. +- You want minimal memory copies and higher read performance. -**You might still use ParquetSharp.DataFrame if:** +**Use ParquetSharp.DataFrame when:** -- You need the simple one-line API: `parquetReader.ToDataFrame()` -- You're working with small files where performance doesn't matter -- You have existing code that already uses it +- Writing DataFrames back to Parquet reliably. +- Your DataFrames include string columns. +- Merging multiple batches into a single DataFrame. -### Performance Notes +### Key Takeaways -The Arrow approach is faster because DataFrames internally use Arrow's memory layout. When you use ParquetSharp.DataFrame, the data gets converted from Parquet → .NET types → DataFrame, but with the Arrow API it goes directly from Parquet → Arrow → DataFrame with zero-copy operations where possible. +- **Arrow + FromArrowRecordBatch()** is safe and faster for reading Parquet files into DataFrames. +- **ParquetSharp.DataFrame is more reliable** for writing DataFrames back to Parquet. +- `ToArrowRecordBatches()` and `Append()` are unreliable for writing or merging batches.. +- **Writing and combining DataFrames** still requires `ParquetSharp.DataFrame`. ## See Also From f5735f657f556f533cc5d04eab73b2223a87fcf9 Mon Sep 17 00:00:00 2001 From: Fortunate Omonuwa Date: Wed, 26 Nov 2025 17:18:16 +0100 Subject: [PATCH 3/8] update docs --- docs/guides/DataframesViaArrow.md | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/docs/guides/DataframesViaArrow.md b/docs/guides/DataframesViaArrow.md index e1a35a64..1d9529b5 100644 --- a/docs/guides/DataframesViaArrow.md +++ b/docs/guides/DataframesViaArrow.md @@ -85,16 +85,14 @@ var df = reader.ToDataFrame(); ### When to Use Arrow vs ParquetSharp.DataFrame -**Use the Arrow approach when:** - -- Reading Parquet data into DataFrames and you want a more efficient way to do this. -- You want minimal memory copies and higher read performance. - -**Use ParquetSharp.DataFrame when:** - -- Writing DataFrames back to Parquet reliably. -- Your DataFrames include string columns. -- Merging multiple batches into a single DataFrame. +| Task | Arrow API | ParquetSharp.DataFrame | +|------|-----------|------------------------| +| **Reading** Parquet to DataFrame | ✅ Recommended - Faster, less memory copying | ✅ Works - Simple one-line API | +| **Writing** DataFrame to Parquet | ❌ Unreliable - Fails with string columns | ✅ Recommended - Reliable for all column types | +| **String columns** | ⚠️ Read-only support | ✅ Full read/write support | +| **Merging batches** | ❌ `Append()` is unreliable | ✅ Works reliably | +| **Performance** | ⚠️ Faster for reads only | ⚠️ Slower but more reliable | +| **Use case** | Large file reads, streaming | Writing, string data, combining data | ### Key Takeaways From c461b7ded371254c8be34249c8dcf69a2d6bfa55 Mon Sep 17 00:00:00 2001 From: Fortunate Omonuwa Date: Wed, 26 Nov 2025 17:32:11 +0100 Subject: [PATCH 4/8] update packages needed on docs --- docs/guides/DataframesViaArrow.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/guides/DataframesViaArrow.md b/docs/guides/DataframesViaArrow.md index 1d9529b5..b830a47c 100644 --- a/docs/guides/DataframesViaArrow.md +++ b/docs/guides/DataframesViaArrow.md @@ -7,8 +7,9 @@ ParquetSharp now provides Arrow-based APIs for reading and working with `.NET Da You'll need these packages: ```xml - - + + + ``` ### Reading a Single Batch from Parquet From ba7662d6339ce1e8b082f458fac003989cc8be79 Mon Sep 17 00:00:00 2001 From: Fortunate Date: Mon, 8 Dec 2025 16:42:40 +0100 Subject: [PATCH 5/8] Update DataframesViaArrow.md --- docs/guides/DataframesViaArrow.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/guides/DataframesViaArrow.md b/docs/guides/DataframesViaArrow.md index b830a47c..febcafaa 100644 --- a/docs/guides/DataframesViaArrow.md +++ b/docs/guides/DataframesViaArrow.md @@ -41,7 +41,8 @@ This works reliably for all standard DataFrames. ### Reading All Batches Separately For files with multiple batches, each batch can be converted into a DataFrame individually. -**Note**: Do not try to merge batches into a single DataFrame using Append; it is unreliable, especially with string columns. + +**Note**: Combining multiple batches using `Append()` is unreliable... Particularly with sting columns. ```csharp using var fileReader = new FileReader("sample.parquet"); @@ -107,4 +108,4 @@ var df = reader.ToDataFrame(); For more details, check out: - [ParquetSharp Arrow API Documentation](https://g-research.github.io/ParquetSharp/guides/Arrow.html) - [DataFrame.FromArrowRecordBatch Method](https://learn.microsoft.com/en-us/dotnet/api/microsoft.data.analysis.dataframe.fromarrowrecordbatch?view=ml-dotnet-preview) -- [DataFrame.ToArrowRecordBatches Method](https://learn.microsoft.com/en-us/dotnet/api/microsoft.data.analysis.dataframe.toarrowrecordbatches?view=ml-dotnet-preview) \ No newline at end of file +- [DataFrame.ToArrowRecordBatches Method](https://learn.microsoft.com/en-us/dotnet/api/microsoft.data.analysis.dataframe.toarrowrecordbatches?view=ml-dotnet-preview) From b12ca5c0afc302324c0be01b391d2ae1761f694a Mon Sep 17 00:00:00 2001 From: Fortunate Date: Mon, 8 Dec 2025 16:51:32 +0100 Subject: [PATCH 6/8] Fix duplicate header in DataframesViaArrow guide Removed duplicate section header for clarity. --- docs/guides/DataframesViaArrow.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/guides/DataframesViaArrow.md b/docs/guides/DataframesViaArrow.md index febcafaa..74ca9d62 100644 --- a/docs/guides/DataframesViaArrow.md +++ b/docs/guides/DataframesViaArrow.md @@ -1,5 +1,4 @@ - -## Working with DataFrames via Arrow +# Working with DataFrames via Arrow ParquetSharp now provides Arrow-based APIs for reading and working with `.NET DataFrame objects`. Using Arrow can improve performance and reduce unnecessary memory copies. **However, there are limitations**. From b8d26812214d556b56d83403c73427384d0d5d8b Mon Sep 17 00:00:00 2001 From: Fortunate Date: Mon, 8 Dec 2025 16:53:35 +0100 Subject: [PATCH 7/8] fix: update dataframesviaarrow.md --- docs/guides/DataframesViaArrow.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/guides/DataframesViaArrow.md b/docs/guides/DataframesViaArrow.md index 74ca9d62..e18f77b3 100644 --- a/docs/guides/DataframesViaArrow.md +++ b/docs/guides/DataframesViaArrow.md @@ -2,7 +2,7 @@ ParquetSharp now provides Arrow-based APIs for reading and working with `.NET DataFrame objects`. Using Arrow can improve performance and reduce unnecessary memory copies. **However, there are limitations**. -### Prerequisites +## Prerequisites You'll need these packages: ```xml @@ -11,7 +11,7 @@ You'll need these packages: ``` -### Reading a Single Batch from Parquet +## Reading a Single Batch from Parquet Arrow integration works reliably for reading a single batch. Here's how to read one batch and convert it to a DataFrame: @@ -38,7 +38,7 @@ if (batch != null) This works reliably for all standard DataFrames. -### Reading All Batches Separately +## Reading All Batches Separately For files with multiple batches, each batch can be converted into a DataFrame individually. **Note**: Combining multiple batches using `Append()` is unreliable... Particularly with sting columns. @@ -68,13 +68,13 @@ foreach (var df in dataFrames) } ``` -### Key Notes +## Key Notes - **Clone to avoid disposal issues:** Each DataFrame should be cloned to remain valid after the batch is disposed. - **Do not rely on merging Arrow DataFrames:** Append and combining multiple batches is unreliable, particularly with string columns. -### Writing DataFrames to Parquet +## Writing DataFrames to Parquet - ToArrowRecordBatches() is not reliable for string columns. - For safe writing, continue using ParquetSharp.DataFrame: @@ -84,7 +84,7 @@ using var reader = new ParquetSharp.ParquetReader("input.parquet"); var df = reader.ToDataFrame(); ``` -### When to Use Arrow vs ParquetSharp.DataFrame +## When to Use Arrow vs ParquetSharp.DataFrame | Task | Arrow API | ParquetSharp.DataFrame | |------|-----------|------------------------| @@ -95,7 +95,7 @@ var df = reader.ToDataFrame(); | **Performance** | ⚠️ Faster for reads only | ⚠️ Slower but more reliable | | **Use case** | Large file reads, streaming | Writing, string data, combining data | -### Key Takeaways +## Key Takeaways - **Arrow + FromArrowRecordBatch()** is safe and faster for reading Parquet files into DataFrames. - **ParquetSharp.DataFrame is more reliable** for writing DataFrames back to Parquet. From 29da0b7373d140e954c48247ac2ee65a8cc42a4c Mon Sep 17 00:00:00 2001 From: Fortunate Date: Tue, 9 Dec 2025 16:33:20 +0100 Subject: [PATCH 8/8] Fix typo in DataframesViaArrow.md --- docs/guides/DataframesViaArrow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/DataframesViaArrow.md b/docs/guides/DataframesViaArrow.md index e18f77b3..9a92b974 100644 --- a/docs/guides/DataframesViaArrow.md +++ b/docs/guides/DataframesViaArrow.md @@ -41,7 +41,7 @@ This works reliably for all standard DataFrames. ## Reading All Batches Separately For files with multiple batches, each batch can be converted into a DataFrame individually. -**Note**: Combining multiple batches using `Append()` is unreliable... Particularly with sting columns. +**Note**: Combining multiple batches using `Append()` is unreliable... Particularly with string columns. ```csharp using var fileReader = new FileReader("sample.parquet");