Hands on Data Engineering in Palantir foundry - Jyotin Padhi
1️⃣ Introduction to Big Data & PySpark
What is Big Data?
Hadoop ecosystem overview
Spark vs Hadoop MapReduce
Installation & environment setup
Introduction to PySpark architecture
2️⃣ PySpark Core Concepts
RDDs (Resilient Distributed Datasets)
Transformations & actions
Lazy evaluation
RDD persistence & optimization
3️⃣ PySpark DataFrames & SQL
DataFrame creation & operations
Schema definition
Importing CSV, JSON, Parquet, ORC
Spark SQL basics
SQL queries on large datasets
Window functions
4️⃣ Data Processing & ETL with PySpark
Data cleaning
Handling nulls & duplicates
Joins & aggregations
User-defined functions (UDFs)
File formats & partitioning
ETL pipelines with PySpark
5️⃣ Big Data Analytics with PySpark
Exploratory data analysis
Distributed computing principles
Performance optimization techniques
Caching & checkpointing
Cluster management basics
6️⃣ PySpark MLlib (Basics)
Basic ML algorithms with PySpark
Feature engineering in Spark
Pipelines & model evaluation
7️⃣ Real-Time & Batch Processing (Optional Module)
Introduction to Spark Streaming
Structured streaming concepts
Batch processing workflows
8️⃣ Hands-on Projects
ETL pipeline for large datasets
Analytics dashboard-ready dataset creation
Big Data business case implementation

