Cloudera Data Scientist Training

Using scenarios and datasets from a fictional technology company, students discover insights to support critical business decisions and develop data products to transform the business.

The material is presented through a sequence of brief lectures, interactive demonstrations, extensive hands-on exercises, and discussions. The Apache Spark™ demonstrations and exercises are conducted in Python (with PySpark) and R (with sparklyr) using the Cloudera Data Science Workbench (CDSW) environment.

    Note: This four-day workshop covers enterprise data science and machine learning using Apache Spark in Cloudera Data Science Workbench (CDSW).

    1:1 Coaching

    24*7 Support

    Cloud Labs

    High Success Rate

    Globally Renowned Trainer

    Real-time code analysis and feedback

    Course Description

    This four-day workshop covers enterprise data science and machine learning using Apache Spark in Cloudera Data Science Workbench (CDSW). Participants use Spark SQL to load, explore, cleanse, join, and analyze data and Spark MLlib to specify, train, evaluate, tune, and deploy machine learning pipelines.

    They dive into the foundations of the Spark architecture and execution model necessary to effectively configure, monitor, and tune their Spark applications. Participants also learn how Spark integrates with key components of the Cloudera platform such as HDFS, YARN, Hive, Impala, and Hue as well as their favourite Python or R packages.

    Learning Objectives

    • How to use Apache Spark to run data science and machine learning workflows at scale
    • How to use Spark SQL and DataFrames to work with structured data
    • How to use MLlib, Spark’s machine learning library
    • How to use PySpark, Spark’s Python API
    • How to use sparklyr, a dplyr-compatible R interface to Spark
    • How to use Cloudera Data Science Workbench (CDSW)
    • How to use other Cloudera platform components including HDFS, Hive, Impala, and Hue

    Certification Curriculum

    Module 1
    Data Science Overview
    • What Data Scientists Do
    • What Process Data Scientists Use
    • What Tools Data Scientists Use
    Module 2
    Cloudera Data Science Workbench (CDSW)
    • Introduction to Cloudera Data
    Module 3
    Science Workbench
    • How Cloudera Data Science
    Module 4
    Workbench Works
    • How to Use Cloudera Data Science
    Module 5
    • Entering Code
    • Getting Help
    • Accessing the Linux Command Line
    • Working with Python Packages
    • Formatting Session Output
    Module 6
    Case Study
    • DuoCar
    • How DuoCar Works
    • DuoCar Datasets
    • DuoCar Business Goals
    • DuoCar Data Science Platform
    • DuoCar Cloudera EDH Cluster
    • HDFS
    • Apache Spark
    • Apache Hive
    • Apache Impala
    • Hue
    • YARN
    • DuoCar Cluster Architecture
    Module 7
    Apache Spark
    • Apache Spark
    • How Spark Works
    • The Spark Stack
    • Spark SQL
    • DataFrames
    • File Formats in Apache Spark
    • Text File Formats
    • Parquet File Format
    Module 8
    Summarizing and Grouping DataFrames
    • Summarizing Data with Aggregate
    • Functions
    • Grouping Data
    • Pivoting Data
    Module 9
    Window Functions
    • Introduction to Window Functions
    • Creating a Window Specification
    • Aggregating over a Window Specification
    Module 10
    Exploring DataFrames
    • Possible Workflows for Big Data
    • Exploring a Single Variable
    • Exploring a Categorical Variable
    • Exploring a Continuous Variable
    • Exploring a Pair of Variables
    • Categorical-Categorical Pair
    • Categorical-Continuous Pair
    • Continuous-Continuous Pair
    Module 11
    Apache Spark Job Execution
    • DataFrame Operations
    • Input Splits
    • Narrow Operations
    • Wide Operations
    • Stages and Tasks
    • Shuffle
    Module 12
    Processing Text and Training and Evaluating Topic Models
    • Introduction to Topic Models
    • Scenario
    • Extracting and Transforming Features
    • Parsing Text Data
    • Removing Common (Stop) Words
    • Counting the Frequency of Words
    • Specifying a Topic Model
    • Training a topic model using Latent Dirichlet Allocation (LDA)
    • Assessing the Topic Model Fit
    • Examining a Topic Model
    • Applying a Topic Model
    Module 13
    Training and Evaluating Recommender Models
    • Introduction to Recommender Models
    • Scenario
    • Preparing Data for a Recommender Model
    • Specifying a Recommender Model
    • Spark Interface Languages
    • PySpark
    • Data Science with PySpark
    • sparklyr
    • dplyr and sparklyr
    • Comparison of PySpark and sparklyr
    • How sparklyr Works with dplyr
    • sparklyr DataFrame and MLlib Functions
    • When to Use PySpark and sparklyr
    Module 14
    Running a Spark Application from (CDSW)
    • Overview
    • Starting a Spark Application
    • Reading Data into a Spark SQL Dataframe
    • Examining the Schema of a Data Frame
    • Computing the Number of Rows and Overview
    • Starting a Spark Application
    • Reading Data into a Spark SQL Dataframe
    • Examining the Schema of a Data Frame
    • Computing the Number of Rows and Columns of a DataFrame
    • Examining Rows of a DataFrame
    • Stopping a Spark Application
    Module 15
    Inspecting a Spark SQL DataFrame
    • Overview
    • Inspecting a DataFrame
    • Inspecting a DataFrame Column
    • Inspecting a Primary Key Variable
    • Inspecting a Categorical Variable
    • Inspecting a Numerical Variable
    • Inspecting a Date and Time Variable
    Module 16
    Transforming DataFrames
    • Spark SQL DataFrames
    • Working with Columns
    • Selecting Columns
    • Dropping Columns
    • Specifying Columns
    • Adding Columns
    • Changing the Column Name
    • Changing the Column Type
    Module 17
    Monitoring, Tuning, and Configuring Spark Applications
    • Monitoring Spark Applications
    • Persisting DataFrames
    • Partitioning DataFrames
    • Configuring the Spark Environment
    Module 18
    Machine Learning Overview
    • Machine Learning
    • Underfitting and Overfitting
    • Model Validation
    • Hyperparameters
    • Supervised and Unsupervised Learning
    • Machine Learning Algorithms
    • Machine Learning Libraries
    • Apache Spark MLlib
    Module 19
    Training and Evaluating Regression Models
    • Introduction to Regression Models
    • Scenario
    • Preparing the Regression Data
    • Assembling the Feature Vector
    • Creating a Train and Test Set
    • Specifying a Linear Regression Model
    • Training a Linear Regression Model
    • Examining the Model Parameters
    • Examining Various Model Performance Measures
    • Examining Various Model Diagnostics
    • Applying the Linear Regression Model to the Test Data
    • Evaluating the Linear Regression Model on the Test Data
    • Plotting the Linear Regression Model
    • Training a Recommender Model using Alternating Least Squares
    • Examining a Recommender Model
    • Applying a Recommender Model
    • Evaluating a Recommender Model
    • Generating Recommendations
    Module 20
    Working with Machine Learning Pipelines
    • Specifying Pipeline Stages
    • Specifying a Pipeline
    • Training a Pipeline Model
    • Querying a Pipeline Model
    • Applying a Pipeline Model
    Module 21
    Deploying Machine Learning Pipelines
    • Saving and Loading Pipelines and Pipeline Models in Python
    • Loading Pipelines and Pipeline Models in Scala
    • Working with Rows
    • Ordering Rows
    • Selecting a Fixed Number of Rows
    • Selecting Distinct Rows
    • Filtering Rows
    • Sampling Rows
    • Working with Missing Values
    Module 23
    Transforming DataFrame Columns
    • Spark SQL Data Types
    • Working with Numerical Columns
    • Working with String Columns
    • Working with Date and Timestamp Columns
    • Working with Boolean Columns
    Module 24
    Complex Types
    • Complex Collection Data Types
    • Arrays
    • Maps
    • Structs
    Module 25
    User-Defined Functions
    • User-Defined Functions
    • Defining a Python Function
    • Registering a Python Function as a User-Defined Function
    • Applying a User-Defined Function
    Module 26
    Reading and Writing Data
    • Reading and Writing Data
    • Working with Delimited Text Files
    • Working with Text Files
    • Working with Parquet Files
    • Working with Hive Tables
    • Working with Object Stores
    • Working with pandas DataFrames
    Module 27
    Combining and Splitting DataFrames
    • Joining DataFrames
    • Cross Join
    • Inner Join
    • Left Semi Join
    • Left Anti Join
    • Left Outer Join
    • Right Outer Join
    • Full Outer Join
    • Applying Set Operations to DataFrames
    • Splitting a DataFrame
    Module 28
    Training and Evaluating Classification Models
    • Introduction to Classification Models Scenario
    • Preprocessing the Modeling Data
    • Generate a Label
    • Extract, Transform, And Select Features
    • Create Train and Test Sets
    • Specify A Logistic Regression Model
    • Train the Logistic Regression Model
    • Examine the Logistic Regression Model
    • Evaluate Model Performance on the Test Set
    Module 29
    Tuning Algorithm Hyperparameters Using Grid Search
    • Requirements for Hyperparameter Tuning
    • Specifying the Estimator
    • Specifying the Hyperparameter Grid
    • Specifying the Evaluator
    • Tuning Hyperparameters using Holdout Cross-validation
    • Tuning Hyperparameters using K-fold Cross-validation
    Module 30
    Training and Evaluating Clustering Models
    • Introduction to Clustering Scenario
    • Preprocessing the Data
    • Extracting, Transforming, and Selecting Features
    • Specifying a Gaussian Mixture Model
    • Training a Gaussian Mixture Model
    • Examining the Gaussian Mixture Model
    • Plotting the Clusters
    • Exploring the Cluster Profiles
    • Saving and Loading the Gaussian
    • Mixture Model