YOLO algorithm

Last updated on:12 days ago

YOLO (you only look once) is a fast detection algorithm, which is widely used in autonomous driving car.


YOLO is an Image classification and localization algorithm. What the YOLO algorithm does is it takes the midpoint of each of the objects and then assigns the object to the grid cell containing the midpoint.

For each grid cell:

$$L (\hat{y}, y) = (\hat{y}_1 - y_1)^2 + (\hat{y}_2 - y_2)^2 + … + (\hat{y}_8 - y_8)^2, if y_1 = 1
(\hat{y}_1 - y_1)^2 if y_1 = 0$$

The object is only assigned to one of the grid cells. In practice maybe use 19\times 19 \times 8 grids.



  • output precise bounding boxes
  • output much more precise coordinates that are not just dictated by the stripe size of your sliding windows classifier
  • this is a convolutional implementation and you only need to implement this algorithm once
  • runs very fast, it works even for real time object detection

Specify the bounding boxes

Intersection over union (IoU)

Evaluating object localization. IoU is a measure of the overlap between two bounding boxes. IoU means the ratio of bounding box to ground truth.

Non-max suppression

$p_c \le 0.6$
While there are any remaining boxes:

  • Pick the box with the largest $p_c$ output that as a prediction (just for car detection)

  • Discard any remaining box with IoU $\ge 0.5$ with the box output in the previous step

Outputting the non-max supressed outputs

  • For each grid call, get 2 predicted bounding boxes
  • get rid of low probability predictions
  • for each class (pedestrian, car, motorcycle) use non-max suppression to generate final predictions


6.Suppose you run non-max suppression on the predicted boxes above. The parameters you use for non-max suppression are that boxes with probability $le$ 0.4 are discarded, and the IoU threshold for deciding if two boxes overlap is 0.5. How many boxes will remain after non-max suppression?

Answer: 5

Because the boundary of tree 0.74 and tree 0.46 did not overlap.

Anchor boxes

Each object in training image is assigned to grid cell that contains that object’s midpoint (and anchor box for the grid cell with highest IoU)
$$3 \times 3 \times 16 = 3 \times 3 \times 2 \times 8$$

Region proposals (optional): R-CNN

Region - convolutional neural network
segmentation algorithm
N 2000 block, quite slow
R-CNN: propose regions. classify proposed regions one at a time. output label + bounding box
Fast R-CNN: Propose regions. use convolutional implementation of sliding windows to classify al the proposed regions
Faster R-CNN: use convolutional network to propose regions (slower that YOLO algorithm)


[1] Redmon, J., Divvala, S., Girshick, R. and Farhadi, A., 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779-788).

[2] Deeplearning.ai, Convolutional Neural Networks