Big Data Project - Apache Spark and Java

Table of Content

1. Apache Spark introduction

2. Getting Started with Spark

2. Spark Dataframe basic operations

3. Spark Dataframe advanced operations

4. Spark SQL and other functionalities

5. Big data batching application

6. Deploy and cluster execution

7. Monitoring and performance fundamentals

Apache Spark introduction

Why choose Spark?

What is Spark?

A brief history of Spark

Enter MapReduce
Spark arrives

A comprehensive stack

Core components and architecture

Big data primer

Big data life cycle

Spark and the batch data processing model

Distributed processing model

Getting Started with Spark and Java

Spring boot CLI application

Project structure

Spark Dataframe basic operations

Dataframe's schema

Dataframe of POJO

Transformation and action

Transformation (I): Map and Filter

Transformation (II): Flatmap and Distinct

Action (I): Count, Take and Collect

Action (II): Reduce and Aggregation (Max, Min, Mean)

Deep dive: Internal of Spark execution

Spark Dataframe advanced operations

Data partitioning and shuffling

Transformation (III): GroupBy and GroupByKey

Transformation (IV): Join

Transformation (V): Union, UnionByName, UnionAll and DropDuplications

Sharing data in cluster: Accumulators and Broadcast variable

UDFs: User-defined functions

Spark SQL and other functionalities

1. Ingest files

1. CSV
2. Jsonline
3. Json
4. Text
5. XML
6. Parquet

2. Ingest databases

1. Delta table (upgraded parquet)

Big data batching application

1. The application architecture ecosystem

Management and scheduling tier

Workflow tier

Logging and monitoring tier

Processing tier

Storage tier

Database tier

2. Cloud architecture - AWS

Deploy and cluster execution

Monitoring and performance fundamentals

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
firstdb		firstdb
src/main		src/main
static		static
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
pom.xml		pom.xml

Folders and files

Latest commit

History

Repository files navigation

Big Data Project - Apache Spark and Java

Table of Content

Apache Spark introduction

Why choose Spark?

What is Spark?

A brief history of Spark

A comprehensive stack

Core components and architecture

Big data primer

Big data life cycle

Spark and the batch data processing model

Distributed processing model

Getting Started with Spark and Java

Spring boot CLI application

Project structure

Spark Dataframe basic operations

Dataframe's schema

Dataframe of POJO

Transformation and action

Transformation (I): Map and Filter

Transformation (II): Flatmap and Distinct

Action (I): Count, Take and Collect

Action (II): Reduce and Aggregation (Max, Min, Mean)

Deep dive: Internal of Spark execution

Spark Dataframe advanced operations

Data partitioning and shuffling

Transformation (III): GroupBy and GroupByKey

Transformation (IV): Join

Transformation (V): Union, UnionByName, UnionAll and DropDuplications

Sharing data in cluster: Accumulators and Broadcast variable

UDFs: User-defined functions

Spark SQL and other functionalities

1. Ingest files

2. Ingest databases

Big data batching application

1. The application architecture ecosystem

Management and scheduling tier

Workflow tier

Logging and monitoring tier

Processing tier

Storage tier

Database tier

2. Cloud architecture - AWS

Deploy and cluster execution

Monitoring and performance fundamentals

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages