12 Nov 2019

# Training Deep Learning Models On multi-GPus

27 Feb 2019

**Deep Learning disruption** is providing amazing results on several Machine Learning fields: Computer Vision, Speech Recognition, Machine Translation or Ranking problems. The latter applies to Recommender Systems that focus on learning to rank instead of predicting ratings.

This is the 2nd part of the article describing my experiment with Semantic Segmentation, so if you just landed in this article, I encouraged you to have a look here before continuing.

The fundamentals of current Deep Learning techniques were known for decades. Artificial Neural Networks come from 60s, Backpropagation has 30 years old, and 7-layer LeNet Convolutional Neuronal Network was presented on 1998 by Yann Le Cunn, current director of AI Research at Facebook. However Deep Learning recent success is mainly due to 3 reasons: **more**(labeled)** data, more computing power**, and some tricks.

Actually Deep Learning explosion is **powered by GPUs**! And in terms of accelerating neural networks, GPU is synonym of **NVIDIA** and **CUDA**. Curiously, GPUs were originally powered by video gamers. And some of the last advances in Artificial Intelligence are also related to video games.

## The challenge

Training 90 epochs of ResNet-50 model on ImageNet dataset can take approximately a week on a single GPU. Even with an optimized implementation and a pretty good GPU like Pascal Titan X it takes more than 3 days. And bigger models with bigger datasets usually get better results. This emphasizes the necessity to **accelerate training** of Deep Learning models. But this is a real challenge!

At BBVA Next Technologies Labs we performed some internal benchmarks testing several **data-parallel** implementations of well-known image classification models. Mainly LeNet, AlexNet and ResNet50. Our scenario included MNIST and CIFAR10 datasets, also well-known though much smaller than ImageNet. And implementations on Tensorflow, Keras+Tensorflow and MxNet. Detailed conclusions of our benchmark will excede the limits of a 5 mins post. So I’ll summarize our first experiences in this topic.

We performed some tests on a p2.8x instance with 8 NVIDIA Tesla K80 on AWS. And we share our results on MxNet CIFAR10 image classification with Resnet50 model. Following a synchronous data-parallel Stochastic Gradient Descent implementation configured on mini-batches of 256 samples per GPU. Fortunately MxNet examples make easy to achieve this goal.

According to the previous benchmark, p2.8x scaled near linearly with the number of GPUs. With 90% efficiency in scaling up from 1 to 8 GPUs.

Otherwise, we realized larger batch sizes would be required in order to completely saturate the 8 K80 GPUs. But on expenses of non-trivial convergence difficulties we should deal to avoid accuracy degradation.

## Our conclusions

Some general recommendations can be made to efficiently scale training of Deep Learning models:

**Dataset**matters: first of all Cifar10 images appears to be a**too small dataset**to use for benchmarking. As computation required for processing each sample is minimal. And it makes difficult to adequately saturate the GPU utilization. So it’s hard to exploit the possibilities of modern GPUs with such small dataset.**Data pipeline**matters: even the way to get data from hard disk can be a bottleneck too. Or issues related with image preprocessing. So you probably need Solid State Disks (SSD) and an efficient pipeline to feed your model.**Model**matters: usually bigger models increase both computation cost and network bandwith requirements. However, the correlation is not the same for all deep learning models. For instance,*Alexnet*requires a relatively small number of operations for a single forward pass. But it has a huge number of network parameters to share. So it’s specially bandwith demanding.**Topology**matters: for big models, with modern GPUs,**bandwith**of interconnections is actually the main bottleneck. Even within a single node there is a big difference between PCIe and recent NVlink interconnections. And going multinode latencies get much worse, so Infiniband or very high speed Ethernet connections (>25Gbit/s) are needed.

**Batch size**and**learning rates**matter. In order to reduce communication overhead and effectively use all the computing parallelism GPUs offer, we require large batch sizes. But large batch sizes cause convergence difficulties to Stochastic Gradient Descent degrading accuracy. Recent paper from Facebook AI Research shows it’s possible to practically circumvent this situation by progressively increasing learning rates in a linear scale. Appart from many other considerations.

Source: https://research.fb.com/publications/accurate-large-minibatch-sgd-training-imagenet-in-1-hour/

Besides, other implementation details also matter. And we will go deep on this topic on future posts. As devil is in details. Anyhow, all benchmarks are wrong, some are useful. And this post is going to exceed the barrier of 5 mins reading 😉

## Acknowledgement:

We have to thank Rafael and Francisco from

Azken Muga. Thanks to NVIDIA engineers for providing us valuable feedback about this post and more tips about benchmarking and efficiently training deep neural networks. Thanks to Gorka fromAiloveu

Imagen principal: Pexels

Las opiniones vertidas por el autor son enteramente suyas y no siempre representan la opinión de BBVA Next Technologies.

### ¿Quieres saber que más cosas hacemos en BBVA Next Technologies?

## Posts relacionados

17 Jul 2019

25 Mar 2019

## Posts Autor

15 Feb 2019