Inefficient Parquet Conversion with columnify compared to pyarrow

I am working on converting JSONL log files to Parquet format to improve log search capabilities. 
To achieve this, I've been exploring tools compatible with Fluentd, and I came across the [s3-plugin](https://github.com/fluent/fluent-plugin-s3), which uses the `columnify` tool for conversion.

In my quest to find the most efficient conversion method, I conducted tests using two different approaches:
1. I created a custom Python script utilizing the `pandas` and `pyarrow` libraries for JSONL to Parquet conversion.
2. I used the `columnify` tool for the same purpose.

I used a JSONL file containing approximately 27,000 log lines, all structured similarly to the following example:
```
{ "stdouttype": "stdout", "letter": "F", "level": "info", "f_t": "2023-09-21T16:35:46.608Z", "ist_timestamp": "21 Sept 2023, 22:05:46 GMT+5:30", "f_s": "service-name", "f_l": "module_name", "apiName": "<name_of_api>", "workflow": "some-workflow-qwewqe-0", "step": "somestepid0", "sender": "234567854321345670", "traceId": "23456785432134567_wertjlwqkjrtljjjwelfe0", "sid": "", "request": "<stringified-request-body>", "response": "<stringified-request-body>"}
```
For both methods, I generated GZIP-compressed JSON and Parquet files. The image below illustrates the resulting Parquet files:
in the below image you can see 3 parquet files that are generated
 - `main_file.log.gz.parquet` (101KB) is generated by python script (pandas+pyarrow)
 - `main_file1.columnify.parquet` (8.7MB)  is generated by `columnify`

<img width="798" alt="image" src="https://github.com/reproio/columnify/assets/46885684/930f4670-5c49-47a7-bc22-eb6891c37c04">

As shown, the Parquet file generated by columnify is significantly larger than the one created by the Python script.

Upon further investigation, I discovered that the default row_group_size and page_size settings differ between pyarrow (used in the Python script) and columnify (utilizing parquet-go):

In Pyarrow:

Default row_group_size: 1MB (maximum of 64MB)
Default page_size: 1MB

In columnify (parquet-go):
Default row_group_size: 128MB
Default page_size: 8KB

So, I adjusted the page_size for columnify to 1MB (-parquetPageSize 1048576), which reduced the file size from 8.7MB to 438KB. However, modifying the row_group_size option did not result in further size reduction.

I'm seeking help in understanding why the columnify-generated Parquet file remains larger than the one generated by the Python script using pyarrow.  Is this due to limitations in the parquet-go library ? or am I missing something in my configuration?

kindly give some insights, advice, or any recommendations on optimizing the Parquet conversion process with columnify.

---
LINKS
[pyarrow doc ref. for page_size and row_group_size](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow-parquet-write-table)
[pyarrow default row group size value](https://github.com/apache/arrow/blob/main/python/pyarrow/_parquet.pyx#L45)
[pyarrow default page_size](https://github.com/apache/arrow/blob/main/python/pyarrow/parquet/core.py#L798)
[parquet-go row_group_size and page_size](https://github.com/xitongsys/parquet-go/blob/master/writer/writer.go#L63)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inefficient Parquet Conversion with columnify compared to pyarrow #93

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inefficient Parquet Conversion with columnify compared to pyarrow #93

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions