Coursera Kaggle Course: Exploratory Data Analysis

Exploratory data analysis

EDA allows to:

Better understand the data
Build an intuition about the data
Generate hypothesizes
Find insights

Visualizations

Visualization -> idea: Patterns lead to questions
Idea -> Visualization: Hypothesis testing

With EDA we can:

Get comfortable with the data
Find magic features

Do EDA first. Do not immediately dig into modelling.

Build Intuition about the Data

Get domain knowledge
- It helps to deeper understand the problem
Check if the data is intuiitive
- And agrees with domain knowledge
Understand how the data was generated
- As it is crucial to set up a proper validation

Exploring Anonymized Data

Try to Decode the Features

Guess the true meaning of the feature.

Exploring Individual Features: Guessing Types

Helpful functions:

df.dtype
df.info()
x.value_counts()
x.isnull()

Visualization Tools

EDA is an art! And visualizations are our art tools.

Exploring Individual Features: Plotting

Histograms: plt.hist(x)
Plot: plt.plot(x, ‘.’) (index vs values)
plt.scatter(range(len(x), x, c=y))
Feature Statistics
- df.describe()
- x.mean()
- x.var()
Other Tools
- x.value_count()
- x.isnull()

Exploring Feature Relations

Pairs
- plt.scatter(feature_1, feature_2)
- pd.scatter_matrix(df)
- df.corr(), plt.matshow()
Groups
- Corrplot + clustering
df.mean.sort_values().plot(style=‘.’)

Dataset Clean and the Other Things to Check

Dataset cleaning
- Constant Features: train_df.nunique(axis=1) == 1
- Duplicated Features: train_df.T.drop_duplicates()
Other things to check
- Duplicated rows
  - Check if same rows have same label
  - Find duplicated rows, understand why they are duplicated
- Check if dataset is shuffled

Validation and Overfitting

Validation

It helps us evaluate a quality of the model
It helps us select the model which will perform best on the unseen data

Underfitting VS Overfitting

Underfitting refers to not capturing enough patterns in the data
Generally, overfitting refers to:
- capturing noise
- capturing patterns which do not generalize to test data
In competition, overfitting refers to:
- low model quality on test data, which was unexpected due to validation scores

Validation Strategies

Never use data you train on to measure the quality of your model. The trick is to split all your data into training and validation parts.

Holdout
- ngroups=1
- sklearn.model_selection.ShuffleSplit()
- Split train data into two parts: partA and partB
- Fit the model on partA, predict for partB.
- Use predictions for partB for estimating model quality. Find such hyper-parameters, that quality on partB is maximized
K-fold
- ngroups=k
- sklearn.model_selection.Kfold
- KFold is similar to Holdout repeated K times
- Split train data into K folds
- Iterate though each fold: retrain the model on all folds except current fold, predict for the current fold
- Use the predictions to calculate quality on each fold. Find such hyper-parameters, that quality on each fold is maximized
Leave-one-out (LOO)
- ngroups=len(train)
- sklearn.model_selection.LeaveOneOut
- Iterate over samples: retrain the model on all samples except current sample, predict for the current sample. You will need to retrain the model N times (if N is the number of samples in the dataset)
- In the end you will get LOO predictions for every sample in the trainset and can calculate loss

Notice, When you found the right hyper-parameters and want to get test predictions don’t forget to retrain your model using all training data.

Stratification

It preserve the same target distribution over different folds. It’s useful for:

Small datasets
Unbalanced datasets
Multiclass datasets

Data Splitting Strategies

The validation should always mimic train/test split made by organizers.

Logic of feature generation depends on the data spitting strategies.

Different Approaches to Validation

Different splitting strategies can differ significantly

in generated features
in a way the model will rely on that features
in some kind of target leak

Common Methods

Random, row-wise
- This usually means that the rows are independent of each other
Time-wise
Split the data by a single date.
By ID
Combined

Problems Occurring during Validation

Validation Stage

Causes of different score and optimal parameters:

Too little data
Too diverse and inconsistent data

We should do extensive validation:

Average scores from different KFold splits
Tune model on one split, evaluate score on the other

Submission Stage

We can observe that:

LB score is consistently higher/lower than validation score
LB score is not correlated with validation score at all.

Possible problem:

Incorrect train/test split
Too little data in public LB
train and test data are from different distributions

Conclusion

Data Leakages

Data Leakages in Time Series

Split should be done on time
- In real life we don’t have information from future
- In competitions first thing to look: train/public/private split, is it on time?
Even when split by time, features may contain information about future:
- User history in CTR tasks
- Weather

Unexpected Information

Meta data
Information in IDs
Row order

Leaderboard Probing

Types of LB probing
Categories tightly connected with “id” are vulnerable to LB probing

Metrics Optimization

If you model is scored with some metric, you get best result by optimizing exactly that metric.

Different metrics for different problems.

Regression

MSE, RMSE
R-squared
MAE
®MSPE, MAPE
®MSLE

Classification

Accuracy, LogLoss, AUC
Cohen’s (Quadratic weighted) Kappa

Note: Cover Picture

Super Agents of AI