Note for Full Stack Deep Learning Bootcamp

This is the note for full stack deep learning bootcamp course, the images and content are from their slides. For more detail, please go to their website.

Introduction

The history of deep learning:

Why Now

Three factors:

Data
Compute
Some new ideas

Setting Up Machine Learning Projects

Lifecycle of a ML Project

Prioritizing Projects & Choosing Goals

A general framework for prioritizing projects:

The Economics of AI

AI reduces cost of prediction
Prediction is central for decision making
Cheap prediction means:
- Prediction will be everywhere
- Even in problems where it was too expensive before
Implication: Look for projects where cheap prediction will have a huge business impact

From Prediction Machines: The Simple Economics of Artificial Intelligence(Agrawal, Gans, Goldfarb)

Software 2.0

From: Andrej Karpathy:

Software 1.0 = Traditional programs with explicit instructions.
Software 2.0 = Humans specify goals, and algorithm searches for a program that works
2.0 programmers work with datasets, which get compiled via optimization
Implication: Look for complicated rule-based software where we can learn the rules instead of programming them

Assessing Feasibility of ML Projects

ML project costs tend to scale super-linearly in the accuracy requirement
Product design can reduce need for accuracy

Key points for prioritizing projects

To find high-impact ML problems, look for complex parts of your pipeline and places where cheap prediction is valuable
The cost of ML projects is primarily driven by data availability, but your accuracy requirement also plays a big role

Choosing Metrics

Accuracy, precision, and recall:

Accuracy = Correct / Total = (5+45)/100 = 50%
Precision = true positives / (true positives + false positives) = 45/(5+45) = 90%
Recall = true positives / actual YES = 45/(45+45) = 50%

It’s hard to determine which model is the best using a single metric, so we could combine them together:

Simple average / weighted average
Threshold n-1 metrics, evaluate the nth
More complex / domain-specific formula

For the domain-specific metrics like mAP:

The results using combined metrics is shown below:

Key Points for Choosing a Metric

The real world is messy; you usually carte about lots of metrics
ML system work best when optimizing a single number
Picking a formula for combining metrics. This formula can and will change

Choosing Baselines

Where to Look for Baselines

External Baselines
- Business / engineering requirements
- Published results: make sure comparison is fair
Internal Baselines
- Scripted baselines
  - OpenCV scripts
  - Rules-based methods
- Simple ML baselines
  - Standard feature -based models
  - Linear classifier with hand-engineered features
  - Basic neural network model

How to Create Good Human Baselines

Key Points for Choosing Baselines

Baselines give you a lower bound on expected model performance
The tighter the lower bound, the more useful the baseline

Infrastructure & Tooling

Full View of the DL Infrastructure

GPU Comparison Table

Performance

All-in-one Solutions

Data Management

Most DL applications require lots of labeled data
- RL through self-play, GANs do not – but are not yet practical
Publicly available datasets = No competitive advantage
- But can serve as starting point

Roadmap

Data Labeling

User interfaces
Sources of labor
Service companies

Conclusions:

Outsource to full-service company if you can afford it
If not, then at least use existing software
Hiring part-time makes more sense than trying to make crowd-sourcing work

Data Storage

Building blocks
- Filesystem
- Object Storage
- Database
- “Data Lake”
What goes where
- Binary data(images, sound files, compressed texts) is stored s objects
- Metadata(labels, user activity ) is sorted in database
- If need features which are not obtainable from database(logs), set up data lake and a process to aggregate needed data
- At training time, copy the data that is needed onto filesystem(local or networked)

Data Versioning

Level 0: unversioned
Level 1: versioned via snapshot at training time
Level 2: versioned as a mix of assets and code
Level 3; Specialized data versioning solution

Data Workflows

ML Teams

The AI talent gap
ML-Related roles
ML team structures
The hiring process

Breakdown of Job Function by Role

What Skills are needed for the Roles

Troubleshooting Deep Neural Networks

Why is your performance worse?

Implementation bugs
Hyperparameter choices
Data / model fit
Dataset construction
- Not enough data
- class imbalance
- noisy labels
- train/test from different distributions
- etc

Learning rate is sensitive.

Strategy for DL Troubleshooting

Quick Summary

Starting Simple

Choose a simple architecture
Use sensible defaults
Normalize inputs
Simplify the problem

Choose a Simple Architecture

Architecture Selection:

Dealing with multiple input modalities:

Map each into a lower dimensional feature space
Concatenate
Pass through fully connected layers to output

Use Sensible Defaults

Optimizer: Adam optimizer with learning rate 3e-4
Activations: Relu(FC and Conv models), tanh(LSTMs)
Initialization: He et al. Normal(relu), Glorot normal(tanh)
Regularization: None
Data normalization: None

Definitions of recommended initializers:

Normalize Inputs

Subtract mean and divide by variance
For images, fine to scale values to [0,1] or -0.5, 0.5. Be careful, make sure your library doesn’t do it for you!

Simplify the Problem

Start with a small training set(?10, 000 examples)
Use a fixed number of objects, classes, image size, etc
Create a simpler synthetic training set

Summary for Starting Simple

Implement & Debug

The Five most Common DL Bugs

General advice for implementing your model:

Get Your Model to Run

Shape mismatch:

Casting issue: Common issue is data not in float32, the most common causes are:

Forgot to cast images from uint8 to float32
Generated data using numpy in float64, forgot to cast to float32

OOM:

Other common errors:

Forgot to initialize variables
Forgot to turn off bias when using batch norm
“Fetch argument has invalid type”-usually you overwrote one of your ops with an output during training

Overfit a Single Batch

Compare to a Known Result

Summary for Implement & Debug

Evaluate

Bias-variance Decomposition

Test error = irreducible error + bias + variance + val overfitting + distribution shift

This assumes train, val, and test all come from the same distribution.

Handling distribution shift

Improve Model/Data

Prioritizing improvements (i.e., applied b-v), steps:

Address under-fitting
Address over-fitting
Address distribution shift
Re-balance datasets(if applicable)

Address Under-fitting (i.e., Reducing Bias)

Address Over-fitting (i.e., Reducing Variance)

Address Distribution Shift

Error analysis for pedestrian detection problem:

Domain Adaptation

Domain adaptation is a techniques to train on “source” distribution and generalize to another “target” suing only unlabeled data or limited labeled data.

When should you consider using it?

Access to labeled data from test distribution is limited
Access to relatively similar data is plentiful

Types of domain adaption:

Re-balance Datasets(If Applicable)

If test-val looks significantly better than test, you overfit to the val set due to small val sets or lots of hyper parameter tuning. The solution is to recollect val data.

Tune Hyper-parameters

Coarse-to-fine random searches
Consider Bayesian hyper-parameter optimization solutions as your codebase matures

Manual Hyperparameter Optimization

For a skilled practitioner, may require least computation to get good result.
Requires detailed understanding of the algorithm
Time-consuming

Grid Search

Random Search

Coarse-to-fine

Bayesian Hyperparameter Optimization

Testing & Deployment

Different tests:

Scoring the Test

Testing / CI

Unit / Integration Tests
- Tests for individual module functionality and for the whole system
Continuous Integration
- Tests are run every time new code is pushed to the repository, before updated model is deployed
SaaS for CI
- CircleCI, Travis, Jenkins, Buildkite
Containerization(via Docker)

Web Deployment

REST API
- Serving predictions in response to canonically-formatted HTTP requests
- The web server is running and calling the prediction system
Options
- Deploying code to VMs, scale by adding instances
- Deploy code as containers, scale via orchestration
- Deploy code as a “serverless function”

Deploying Code to Cloud Instances

Cons:

Provisioning can be brittle
Paying for instances even when not using them (auto-scaling does help)

Deploying containers

Cons: - Still managing your own servers and paying for uptime, not compute-time

Deploying Code as Serverless Functions