Inference series I [2nd round]: How to use Caffe with AWS’ Deep Learning AMI for Semantic Segmentation.

This is the 2nd part of the article describing my experiment with Semantic Segmentation, so if you just landed in this article, I encouraged you to have a look here before continuing.


Without a prior knowledge on semantic segmentation, it’s not difficult to devise a metric if we think on the idea that Semantic Segmentation is an extreme case of classification (with some peculiarities): the classification error could work, how many pixels are correctly classified over the total.

If we review some literature, such as “Fully Convolutional Networks for Semantic Segmentation” or “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation”, it seems that there are two prevailing metrics for this task: pixel accuracy and IoU.

  • Pixel accuracy: it’s exactly the same metric we thought could be useful for this task. Just count how many pixels are correctly classified and divide by the total. It’s a very easy metric to compute, however… is it the best metric we can use? If the contour of our dog (see header figure) is 1 or 2 pixels bigger, or smaller, it means we are doing it worse and we should penalize the algorithm? Do you think that if we give the picture to 2 different people and we ask them to mark the contour of the dog, their marks will perfectly match? And it’s really a problem? This metric is useful, but on the other hand, it’s very restrictive and not totally aligned with human behavior.
  • IoU: this name stands for Intersection over Union. Overlap criteria / Pascal VOC’s Overlap and Jaccard Index are other typical names used to refer to the same metric. In this case, slight differences in prediction can yield to the same value, which makes it a flexible metric without jeopardizing its accuracy. This metric is very common in Object Detection tasks and an example of how it works can be seen in the next figures:


IoU for Object Detection task. Source: www.pyimagesearch.com


for the Semantic Segmentation task this metric follow the same philosophy but its computation is a bit more complex, and involves computing confusion matrixes and histograms. I found a couple of implementations in Matlab that seems not very difficult to port to other languages such as Python:

To be honest, with the time limitations I had for this experiment I couldn’t study in depth the insides of IoU for this task (I added to my personal study backlog). However, Pixel Accuracy was just fine to test my ideas.

Time to get hands dirty

With the proper benchmark established, it’s time to put boots on the ground. Here in my Github you can find all the code I used for this experiment.

The first trouble when I opened the AWS’ Deep Learning AMI was Ok, Caffe is installed. But where? How can I choose the version I want? The next logical step was to open a Python terminal and load Caffe, and it seemed to work… but when I run my code it crashed because I had no GPU (remember from the previous article that I was using an M5.large instance), so… what version of Caffe is being used when I type import caffeI dug a bit on Internet and reach to the __file__ property which shows where is located each module. Here is a quick example I ran on my laptop:

That’s how I confirmed that I had to switch the Caffe version… but I was not pretty sure how to achieve it exactly, so it was time to invoke The Wisdom Council (It’s the name I use in private for my colleagues from the DataLab department of BBVA Next Technologies). After a couple of questions, Ramiro and Pelayo pointed me in the right direction and I was able to modify my code to choose exactly the desired version. Here is a gist of how to do it:

The next problems to solve were related to the ground truth and the accuracy.

How to load the ground truth? Seemed a pretty straightforward at the beginning: just loading as a typical image with scipy.misc.imread but when I used the magic %matplotlib notebook in a Jupyter notebook I quickly realized (see next figure) that there was not a clear pattern between the pixel values and its expected value. I had to invoke The Wisdom Council again, but this time the problem was beyond the typical problems we dealt with and we couldn’t find the solution.


This is what I got when reading the ground truth as a typical image. Look at the mouse cursor and the pixel value located at the bottom right. Here we don’t have single scalars but triplets representing RGB values.


Reading some PASCAL VOC code turned on a little bulb in my mind and pointed me in the right direction: it seemed that PNG files have a special mode that can store pixel values as a combination of a palette and an index to that palette. The problem was that I was reading the already-combined values, but to get the ground truth I needed to split those. While I was not able to recover the palette, I did recover the indexes this way:

And this is the ground truth I was expecting to get from the beginning. Now we have values such as 1 (label corresponding to plane) and 255 (ignore label).


Ready to run the experiment!! And the first result was an accuracy around 0.something% 🙁 I would expect something very low, 20%? 30%? due to some typical code mistakes, but not almost 0!! What was happening? In 100 years I would never bet that scipy.misc.imresize was the issue.

How imresize from scipy works? Probably not as everybody expects (pay attention to the recent big red warning). I found that even Karpathy commented complaining about this behavior when I was looking for some answers on Internet about this. At least now the function is better documented.

After experimenting a bit, I hit on the trick to do the resize keeping the same package and the same function:

Once the problems were solved, it was time to get the results.

Note: The whole dataset for semantic segmentation task comprises 1449 images, but in this analysis was truncated to use just the first 100. The inference was done image by image and not leaning on vectorization techniques using mini batches.


  • Time measurements:

  1. Total: 766.160 s
  2. Mean: 7.661 s
  3. Variance: 0.005 s
  4. Median: 7.654 s
  • Accuracy measurements:

  1. Without using the ignore label (pixel value 255). Mean Pixel Accuracy: 0.685
  2. Using the ignore label. Mean Pixel Accuracy: 0.745


If you resisted up to here, my first conclusion is that you learned a bit about semantic segmentation and how to play with it. The second one is that if you paid enough attention, I’m pretty sure you spotted some mistakes, in that case, I encourage you to send me a comment, or ask any doubt, this way, we both learn something new 🙂

The time measurements are consistent: images with the same size take the same time to be forwarded through a Neural Network, there is no stochasticity in there. But almost 8 seconds to process a 500×500 image could be too much from a UX point-of-view, that’s why other hardware such as GPUs or ASIC components (Intel Neural Stick) comes in handy for this task.

The accuracy measurements are very revealing, especially the difference between using the ignore label and not using them. It spots the need of not penalizing using maths what humans with common-sense will not penalize, and also the need for a human-aligned metrics.

A question for the most curious reader: have you thought about the pre-trained model and the data used to do the inference?. Could be an overlap that affects the accuracy? 😉

More and better in the next article of Inference series. I hope you liked it.

Imagen principal; Pexels

Las opiniones vertidas por el autor son enteramente suyas y no siempre representan la opinión de BBVA Next Technologies.

¿Quieres saber que más cosas hacemos en BBVA Next Technologies?