In this project, I had stepped into a role of a data engineer at an E-Commerce company and use PySpark, a powerful tool for data processing. I have been requested by a peer Machine Learning team to clean the data containing the information about orders made last year. They are planning to further use this cleaned data to build a demand forecasting model. To achieve this, they have shared their requirements regarding the desired output table format.
An analyst shared a parquet file called orders_data.parquet for you to clean and preprocess.
You can see the dataset schema below along with the cleaning requirements:
My step-by-step approach throughout the project was:
- Read and load the raw orders data.
- Processed and standardized the order date column.
- Cleaned product information columns for consistency.
- Extracted state details from addresses and counted unique states.
- Prepared and exported the final cleaned dataset for analysis.
-
Here you can see that the total count for
orders_data.parquetis 185,950 records, before any transformation is done. -
Created a
time_of_daycolumn.-
Here you can clearly see that the total count is 176,762 records, after removing the rows containing an orders made at night time.
-
If you want to see my further cleaning, transformation and data processing approach you can see my .py file here Data Processing File.
Note
I completed this project from DataCamp, If you want to do this project you can visit the site by clicking on this DataCamp PySpark Project.




