-
Notifications
You must be signed in to change notification settings - Fork 7
Description
I am working on converting JSONL log files to Parquet format to improve log search capabilities.
To achieve this, I've been exploring tools compatible with Fluentd, and I came across the s3-plugin, which uses the columnify tool for conversion.
In my quest to find the most efficient conversion method, I conducted tests using two different approaches:
- I created a custom Python script utilizing the
pandasandpyarrowlibraries for JSONL to Parquet conversion. - I used the
columnifytool for the same purpose.
I used a JSONL file containing approximately 27,000 log lines, all structured similarly to the following example:
{ "stdouttype": "stdout", "letter": "F", "level": "info", "f_t": "2023-09-21T16:35:46.608Z", "ist_timestamp": "21 Sept 2023, 22:05:46 GMT+5:30", "f_s": "service-name", "f_l": "module_name", "apiName": "<name_of_api>", "workflow": "some-workflow-qwewqe-0", "step": "somestepid0", "sender": "234567854321345670", "traceId": "23456785432134567_wertjlwqkjrtljjjwelfe0", "sid": "", "request": "<stringified-request-body>", "response": "<stringified-request-body>"}
For both methods, I generated GZIP-compressed JSON and Parquet files. The image below illustrates the resulting Parquet files:
in the below image you can see 3 parquet files that are generated
main_file.log.gz.parquet(101KB) is generated by python script (pandas+pyarrow)main_file1.columnify.parquet(8.7MB) is generated bycolumnify
As shown, the Parquet file generated by columnify is significantly larger than the one created by the Python script.
Upon further investigation, I discovered that the default row_group_size and page_size settings differ between pyarrow (used in the Python script) and columnify (utilizing parquet-go):
In Pyarrow:
Default row_group_size: 1MB (maximum of 64MB)
Default page_size: 1MB
In columnify (parquet-go):
Default row_group_size: 128MB
Default page_size: 8KB
So, I adjusted the page_size for columnify to 1MB (-parquetPageSize 1048576), which reduced the file size from 8.7MB to 438KB. However, modifying the row_group_size option did not result in further size reduction.
I'm seeking help in understanding why the columnify-generated Parquet file remains larger than the one generated by the Python script using pyarrow. Is this due to limitations in the parquet-go library ? or am I missing something in my configuration?
kindly give some insights, advice, or any recommendations on optimizing the Parquet conversion process with columnify.
LINKS
pyarrow doc ref. for page_size and row_group_size
pyarrow default row group size value
pyarrow default page_size
parquet-go row_group_size and page_size