How to debug your machine learning system

Last updated on:23 days ago

Sometimes it’s difficult for a newcomer to build their first machine learning system. So, I want to write down some notes from my ML classes to give me some cues to construct the system.

General advice for debugging a learning algorithm

  • Get more training examples (not always works)
  • Try smaller sets of features
  • Try getting additional features
  • Try adding polynomial features
  • Try decreasing/increasing lambda

Debugging your spam classifier

General advice

  • Collect lots of data
  • Develop sophisticated features based on email routing information (from email header)
  • Develop sophisticated features for the message body, features about punctuation
  • Develop a sophisticated algorithm to detect misspelling

It is difficult to tell which of the options will be most helpful.

Error analysis

Recommended approach

  • Start with a simple algorithm that you can implement quickly implement and test on your cross-validation data

  • Plot learning curves to decide if more data, more features, etc. are likely to help

  • Error analysis: Manually examine the examples (in cross-validation set) on which your algorithm made errors. See if you spot any systematic trend in what type of examples it is causing errors.

Error analysis may help decide if this is likely to improve performance. The only solution is to try it and see if it works.

Error matrices for skewed classes

  • Accuracy = (true positives + true negatives) / (total examples)
  • Precision = (true positives) / (true positives + false positives)
  • Recall (or Sensitivity) = (true positives) / (true positives + false negatives)
  • F1 score (F score) = $2\frac{PR}{P+R}$ or $\frac{2}{1/P+1/R}$
  • Specificity = (true negatives) / (true negatives+ false negatives)

Trading off precision and recall.

I am using a single number evaluation metric. dev set + single real number of evaluation matrix.

Algorithm US China India Other Average
A 3% 7% 5% 9% 6%
B 5% 6% 5% 10% 6.5%
C 2% 3% 4% 5% 3.5%
D 5% 8% 7% 2% 5.25%
E 4% 5% 2% 4% 3.75%
F 7% 11% 8% 12% 9.5%

For recall, it is used to measure how much correct data is considered to be wrong.

For precision, it is used to demonstrate how large percent of positive prediction is right.

Satisficing and optimizing metrics

eg. maximize accuracy + running time <= 100ms
N metrics: 1 optimizing, N-1 satisficing
Wake-words/trigger words

Classifier Accuracy Running time
A 90% 80ms
B 92% 95ms
C 95% 1500ms

When to change dev/test sets and metrics

Metric: classification error
algorithm A: 3% error -> pornographic (Metric + dev)
algorithm B 5% error (you/users choose)
error: $$\frac{1}{ m_{dev} } \sum^{m_{dev}}_{i=1} w^{(i)} L{y^{ (i) }_{\text{predict}} \neq y^{(i)}}$$

$$\omega^{(i)} =
1; \text{if} \ x^{(i)}\ \text{is non-porn} \\
10; \text{if} \ x^{(i)}\ \text{is porn}


  • Place the target
  • Aim/shot at the target

If doing well on your metric + dev /test set does not correspond to doing well on your application, change your metric or dev/test set.

Data for machine learning

  • Use a learning algorithm with many parameters; neural network with many hidden layers — low bias
  • Use a relatively large training set – low variance
  • Choose a dev set and test set to reflect data you expect to get in the future and consider it essential to do well on.

Set your test set to be big enough to give high confidence in the overall performance of your system. Maybe without a test set.


[1] Andrew NG, Machine learning

[2], Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization