by chandan singh

a checklist of things that may improve your ml model performance

data splitting

is there any dependence within data splits (e.g. temporal correlations) that would artificially affect your accuracy estimates?

visualizing data

look at histograms of outcomes / key features
see if features can be easily reduced to lower dimensions (e.g. pca, lda)

preprocessing

normalize features and output
balance the data (e.g., random sampling, random sampling + ensembling, smote)
do feature selection with simple screening (e.g. variance, correlation)
do feature selection using a model (e.g. tree, lasso)

debugging

can the model achieve zero training error on a single example?
how do simple baselines (e.g. linear models, decision trees) perform?

feature engineering

visualize the outputs of dim reduction / transformations (e.g. pca, sparse coding, nmf) on your features
for correlated features, group them together or extract out a more stable feature

modeling

try simple rules to cover most of the cases