Inference series I: How to use Caffe with AWS’ Deep Learning AMI for Semantic Segmentation

Believe it or not, it all started with the announcement of the Intel Movidius Neural Compute Stick (NCS). It is a small device pendrive-like, that provides the ability to run Deep Learning models without using a GPU or a powerful machine. We, the Innovation department in BEEVA, were dealing with IoT, and Deep Learning, and we thought could be an awesome idea to join both worlds. “Could we add advanced Computer Vision capabilities to a Raspberry Pi?”

The Neural Stick (in blue) compared with other objects. Small and lightweight, it seems like the average pendrive.

Spoiler: while one of the main purposes is to test the Neural Stick device, in this article we don’t reach that part yet. But we will in another article of this Inference series.

As I already worked in Image Classification last year and with Object Detection in another life many years ago when I was a researcher at the university, I wanted to give a try to Semantic Segmentation. But.. what is it exactly? This is the typical case where 1 picture is better than 1000 images:

Typical applications of Computer Vision. I really love this picture from Facebook AI Research

Semantic segmentation is an extreme case of classification: it’s a per-pixel classification. It means that if your image is of size HxW and you have N categories, your output will be a matrix of size HxWxN: in each pixel you will find an N-dimensional vector with the probabilities of belonging to each class.

As can be expected is very computationally expensive. Perfect to stress our new device 🙂 But first, we need to establish a fair benchmark and configure the experiment. Thus, we define:

  • Framework
  • Dataset
  • Model / Architecture
  • Infrastructure
  • Metrics

Benchmark: ins and outs


The chosen one for this proof of concept (aka PoC) was Caffe. And I’m sure some people would ask “Caffe? In 2017? Why not Caffe2 or Tensorflow?”Besides my own preferences (I think Caffe is really great) the main reason is that the Neural Stick only worked with Caffe, so there was no other option. Another big point was Caffe’s Model Zoo, a repository that contains many pre-trained models ready to use with references to their research papers and so on.

Note: It seems that very recently, Tensorflow has being added too, probably motivated by its huge community.


There are many typical datasets for Computer Vision tasks such as MNISTCIFAR-10 / 100COCO (previously known as MS-COCO) and Pascal VOC Dataset (which its most extended version is 2012 one). Truth be told, a less classical dataset like Cityscapes was very tempting (did I say I love Self-Driving Cars?)


Cityscapes dataset sample

But after fixing the Deep Learning framework (Caffe) and having a look at its Model Zoo, the natural flow was to choose Pascal VOC 2012 dataset as there were already pretrained models based on it.

This dataset comes from the Visual Object Challenge hosted in 2012 by a research group at the Oxford University. It’s a very interesting dataset as it is intended for a variety of visual tasks, such as Image Classification, Object Detection, Semantic Segmentation, Action Classification and Person Layout.

For our purposes of experimenting with Semantic Segmentation, we focus only on the SegmentationClass data split. In the following image, there is a sample of the data we used in this task. Along with the real image, it comes the ground truth for the Object segmentation task and the Class segmentation task.

When doing this experiment and commenting it with my colleagues, two questions arise from here: what does ground truth means and what’s the difference between Object segmentation and Class segmentation. I really don’t like so much the Wikipedia definition, so in my own words, ground truth means the annotated data, or the labeled data. Imagine a human having a look at a picture and writing a CSV with the word dog. The picture beside the CSV constitutes the ground truth. It’s the data which you already know the result, or the target, before training. About the latter, it’s very easy to spot the differences with an example: imagine a scene with two objects of the same class, let’s say a couple of cars. In Class segmentation, both objects will have the same ID while in Object segmentation each of them will have a different ID (of course different than the ID for what it’s considered background). The ground truth of an image for semantic segmentation task is another image where ALL the pixels have an ID. Can you imagine how tedious this manual labeling process could be?


Examples of the Semantic Segmentation data provided in Pascal.


Model / Architecture

For this task, a standard Fully Convolutional architecture (FCN) make the deal.

Image from “Fully Convolutional Networks for Semantic Segmentation”. PAMI 2016


While there are more advanced architectures nowadays (the Computer Vision field evolves extremely fast), for instance U-Net architecture, or FCN in combination with Conditional Random Fields (CRF), again, the presence of a pretrained model ready to download and use was the key. Someone can argue that with this decision we will not get the best accuracy achievable nowadays… and it’s totally true. But the focus is to compare relative values between the CPU and the Neural Stick.

How is the acceleration of the Neural Stick compared when I run my network on CPU? How much accuracy do I lose when doing inference with half precision (16 bits) on the Neural Stick instead of single precision (32 bits) on my CPU? Is it acceptable or my application cannot afford losing any accuracy at all and I should choose going slower?


To ensure reproducibility and more standard conditions than “it works on my laptop” I decided to run the experiment on an M5.large instance in AmazonWeb Services (no GPU)with this AMI Deep Learning AMI Ubuntu Linux — 2.4_Oct2017 — ami-37bb714d

Deep Learning AMIs are a series of AMIs (Amazon Machine Image) prepared by Amazon to speed up the development and deployment of Deep Learning algorithms. Just launch them and you will find a machine ready to work with Keras, Caffe, Tensorflow and so on; with GPU or CPU. Beside Deep Learning frameworks, it comes with NVIDIA drivers, CUDA and CuDNN installed and configured.

The last part of the benchmark, the results and some of the conclusions will be in the next article. I will update this article with the link when the second part is ready.

Update: here it is! The second part.


Imagen principal; Pexels

Las opiniones vertidas por el autor son enteramente suyas y no siempre representan la opinión de BBVA Next Technologies.

¿Quieres saber que más cosas hacemos en BBVA Next Technologies?