Skip to content

Cleaned and Processed an E-Commerce Orders Dataset using PySpark.

License

Notifications You must be signed in to change notification settings

muhammadrauhan/Project-using-PySpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Cleaning an E-Commerce Orders Dataset with PySpark

data check

💡 About

In this project, I had stepped into a role of a data engineer at an E-Commerce company and use PySpark, a powerful tool for data processing. I have been requested by a peer Machine Learning team to clean the data containing the information about orders made last year. They are planning to further use this cleaned data to build a demand forecasting model. To achieve this, they have shared their requirements regarding the desired output table format.

An analyst shared a parquet file called orders_data.parquet for you to clean and preprocess.

You can see the dataset schema below along with the cleaning requirements:

data

🧩 Project Approach

My step-by-step approach throughout the project was:

  • Read and load the raw orders data.
  • Processed and standardized the order date column.
  • Cleaned product information columns for consistency.
  • Extracted state details from addresses and counted unique states.
  • Prepared and exported the final cleaned dataset for analysis.
  • Load the Parquet File to PySpark DataFrame:

    data-head

    Here you can see that the total count for orders_data.parquet is 185,950 records, before any transformation is done.

  • Dealing with order_date Column:

    Created a time_of_day column.

    column
    • Removed the rows containing an Orders made at Night Time:

      rem-row

      Here you can clearly see that the total count is 176,762 records, after removing the rows containing an orders made at night time.


If you want to see my further cleaning, transformation and data processing approach you can see my .py file here Data Processing File.




Note

I completed this project from DataCamp, If you want to do this project you can visit the site by clicking on this DataCamp PySpark Project.

About

Cleaned and Processed an E-Commerce Orders Dataset using PySpark.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published