by chandan singh
a checklist of things that may improve your ml model performance
data splitting
- is there any dependence within data splits (e.g. temporal correlations) that would artificially affect your accuracy estimates?
visualizing data
- look at histograms of outcomes / key features
- see if features can be easily reduced to lower dimensions (e.g. pca, lda)
preprocessing
- normalize features and output
- balance the data (e.g., random sampling, random sampling + ensembling, smote)
- do feature selection with simple screening (e.g. variance, correlation)
- do feature selection using a model (e.g. tree, lasso)
debugging
- can the model achieve zero training error on a single example?
- how do simple baselines (e.g. linear models, decision trees) perform?
feature engineering
- visualize the outputs of dim reduction / transformations (e.g. pca, sparse coding, nmf) on your features
- for correlated features, group them together or extract out a more stable feature
modeling
- try simple rules to cover most of the cases