A Potential “Gotcha” When Building Your Own YOLOv3

For anyone interested in computer vision, you’ve probably thought about, attempted or accomplished building your own object detection network before.  It’s a good challenge after getting comfortable with a deep learning library and wrapping your head around classification models.

One great choice for your first attempt at building an object detection model is YOLOv3.  It works pretty well, its components aren’t too complicated and there are also tons of great resources on it.  Not to mention, there are several YOLOv3 implementations already floating around on Github if you hit a roadblock.

However, one part of building a YOLOv3 model that can catch you off guard, particularly if you have never worked with object detection models before, is handling anchors.  Anchors themselves are not a super complicated concept, but what might throw you off and what definitely threw me off initially is that your ground truth values are likely just a list of bounding boxes, but YOLOv3 actually outputs a prediction for every prior on every anchor in every feature map.

If you are working in Python, this means that your ground truth probably looks like a list of 4 element lists (or maybe tensors).  On the other hand your predictions are probably a tensor that is N x M where N is the amount of all priors for all anchors, and M is the 4 (for bounding box coordinates) plus 1 (for objectiveness score) plus the number of classes. If you were working with COCO dataset, M would be length 85.

This brings up the problem that you have only a handful of ground truth values, but on the other hand your model output has a few thousand rows.  Obviously you couldn’t easily apply a loss function to these two very different data formats.  So you need to somehow have them in the same format.   This leads to the root of the problem, specifically that your targets needs to be “anchorized”, or matched to their respective anchors when training and “de-anchorized”, or reverted back into a list of bounding boxes.

A Proposed Additional Component To Your Model

The solution to this is fairly straightforward.  You’ll just need another component in between your data loader and model that “anchorizes” the data.  Likewise, when running inference, you will also need to grab only the rows that predict an object and convert them back into a list of bounding boxes.

There are a variety of ways of going about building these components and I can’t profess to have the perfect answer.  For “anchorizing”, one thing you could consider is finding the priors with the highest IoU score and choosing those rows as the locations to place your targets.  On the other hand “de-anchorization” is a bit easier.  YOLOv3 contains an objectiveness score, so you could simply just grab all rows where objectiveness is greater than 0.


In any case, this problem caught me off guard the first time I ever ran into it and it is not usually very well documented in research papers.  If you are going to be building a YOLOv3 model, don’t forget that you’ll need to add some sort of components that can handle the conversion of your targets!  If you have additional questions, or you want more information on how to go about building an “anchorizer” please feel free to reach out in the comments!





Leave a Reply