reading-notes

Machine Learning Intro

Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.

Chapter1: Bird’s Eye View

Chapter2: Exploratory Analysis

The purpose of displaying examples from the dataset is not to perform rigorous analysis. Instead, it’s to get a qualitative “feel” for the dataset.

Plot Numerical Distributions

Here are a few things to look out for:

nd

Plot Categorical Distributions

cd

Plot Segmentations

Here are a few insights you could draw from the following chart:

plotseg

Study Correlations

Chapter3: Data Cleaning

Fix Structural Errors

Structural errors are those that arise during measurement, data transfer, or other types of “poor housekeeping.”

Example:

fix

Handle Missing Data

Missing data is a deceptively tricky issue in applied machine learning.

the 2 most commonly recommended ways of dealing with missing data actually suck. They are:

  1. Dropping observations that have missing values
  2. Imputing the missing values based on other observations

Chapter3: Feature Engineering

Feature engineering is about creating new input features from your existing ones.

This is often one of the most valuable tasks a data scientist can do to improve model performance, for 3 big reasons:

Infuse Domain Knowledge

idk

if you suspect that prices would be affected, you could create an indicator variable for transactions during that period.​

Indicator variables are binary variables that can be either 0 or 1.

They “indicate” if an observation meets a certain condition, and they are very useful for isolating key properties.

Create Interaction Features

Combine Sparse Classes

Sparse classes (in categorical features) are those that have very few total observations. They can be problematic for certain machine learning algorithms, causing models to be overfit.

Add Dummy Variables

Dummy variables are a set of binary (0 or 1) variables that each represent a single class from a categorical feature.

Chapter 5: Algorithm Selection

Why Linear Regression is Flawed

Simple linear regression suffers from two major flaws:

  1. It’s prone to overfit with many input features.
  2. It cannot easily express non-linear relationships.

Regularization in Machine Learning

Regularization is a technique used to prevent overfitting by artificially penalizing model coefficients.

Regularized Regression Algos

There are 3 common types of regularized linear regression algorithms:

  1. Lasso Regression

    LASSO stands for Least Absolute Shrinkage and Selection Operator.

  2. Ridge Regression
  3. Elastic-Net

    Elastic-Net is a compromise between Lasso and Ridge.

Decision Tree Algos

Decision trees model data as a “tree” of hierarchical branches. They make branches until they reach “leaves” that represent predictions.

tree

Due to their branching structure, decision trees can easily model nonlinear relationships.

Tree Ensembles

Ensembles are machine learning methods for combining predictions from multiple separate models. There are a few different methods for ensembling, but the two most common are:

Chapter 6: Model Training

How to Train ML Models

  1. Exploring the data.
  2. Cleaning the data.
  3. Engineering new features.

Split Dataset

dataset

Training sets are used to fit and tune your models. Test sets are put aside as “unseen” data to evaluate your models.

Comparing test vs. training performance allows us to avoid overfitting… If the model performs very well on the training data but poorly on the test data, then it’s overfit.

What are Hyperparameters?

Model parameters are learned attributes that define individual models.such as: regression coefficients and decision tree split locations

Hyperparameters express “higher-level” structural settings for algorithms. Such as: strength of the penalty used in regularized regression and the number of trees to include in a random forest

What is Cross-Validation?

These are the steps for 10-fold cross-validation:

  1. Split your data into 10 equal parts, or “folds”.
  2. Train your model on 9 folds (e.g. the first 9 folds).
  3. Evaluate it on the 1 remaining “hold-out” fold.
  4. Perform steps (2) and (3) 10 times, each time holding out a different fold.
  5. Average the performance across all 10 hold-out folds.

Fit and Tune Models

At the end of this process, you will have a cross-validated score for each set of hyperparameter values… for each algorithm.

For example:

fatm

Why is machine learning important

It’s used in: