Cleaning an E-Commerce Orders Dataset with PySpark

💡 About

In this project, I had stepped into a role of a data engineer at an E-Commerce company and use PySpark, a powerful tool for data processing. I have been requested by a peer Machine Learning team to clean the data containing the information about orders made last year. They are planning to further use this cleaned data to build a demand forecasting model. To achieve this, they have shared their requirements regarding the desired output table format.

An analyst shared a parquet file called orders_data.parquet for you to clean and preprocess.

You can see the dataset schema below along with the cleaning requirements:

🧩 Project Approach

My step-by-step approach throughout the project was:

Read and load the raw orders data.
Processed and standardized the order date column.
Cleaned product information columns for consistency.
Extracted state details from addresses and counted unique states.
Prepared and exported the final cleaned dataset for analysis.

Load the Parquet File to PySpark DataFrame:

Here you can see that the total count for orders_data.parquet is 185,950 records, before any transformation is done.
Dealing with order_date Column:

Created a time_of_day column.
- Removed the rows containing an Orders made at Night Time:
  
  Here you can clearly see that the total count is 176,762 records, after removing the rows containing an orders made at night time.

If you want to see my further cleaning, transformation and data processing approach you can see my .py file here Data Processing File.

Note

I completed this project from DataCamp, If you want to do this project you can visit the site by clicking on this DataCamp PySpark Project.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
data-cleaning-and-processing.ipynb		data-cleaning-and-processing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cleaning an E-Commerce Orders Dataset with PySpark

💡 About

🧩 Project Approach

Load the Parquet File to PySpark DataFrame:

Dealing with `order_date` Column:

Removed the rows containing an Orders made at Night Time:

About

Uh oh!

Releases

Packages

Languages

License

muhammadrauhan/Project-using-PySpark

Folders and files

Latest commit

History

Repository files navigation

Cleaning an E-Commerce Orders Dataset with PySpark

💡 About

🧩 Project Approach

Load the Parquet File to PySpark DataFrame:

Dealing with order_date Column:

Removed the rows containing an Orders made at Night Time:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Dealing with `order_date` Column:

Packages