Deep Learning disruption is providing amazing results on several Machine Learning fields: Computer Vision, Speech Recognition, Machine Translation or Ranking problems. The latter applies to Recommender Systems that focus on learning to rank instead of predicting ratings.
This is the 2nd part of the article describing my experiment with Semantic Segmentation, so if you just landed in this article, I encouraged you to have a look here before continuing.
The fundamentals of current Deep Learning techniques were known for decades. Artificial Neural Networks come from 60s, Backpropagation has 30 years old, and 7-layer LeNet Convolutional Neuronal Network was presented on 1998 by Yann Le Cunn, current director of AI Research at Facebook. However Deep Learning recent success is mainly due to 3 reasons: more(labeled) data, more computing power, and some tricks.
Actually Deep Learning explosion is powered by GPUs! And in terms of accelerating neural networks, GPU is synonym of NVIDIA and CUDA. Curiously, GPUs were originally powered by video gamers. And some of the last advances in Artificial Intelligence are also related to video games.
Training 90 epochs of ResNet-50 model on ImageNet dataset can take approximately a week on a single GPU. Even with an optimized implementation and a pretty good GPU like Pascal Titan X it takes more than 3 days. And bigger models with bigger datasets usually get better results. This emphasizes the necessity to accelerate training of Deep Learning models. But this is a real challenge!
At BBVA Next Technologies Labs we performed some internal benchmarks testing several data-parallel implementations of well-known image classification models. Mainly LeNet, AlexNet and ResNet50. Our scenario included MNIST and CIFAR10 datasets, also well-known though much smaller than ImageNet. And implementations on Tensorflow, Keras+Tensorflow and MxNet. Detailed conclusions of our benchmark will excede the limits of a 5 mins post. So I’ll summarize our first experiences in this topic.
We performed some tests on a p2.8x instance with 8 NVIDIA Tesla K80 on AWS. And we share our results on MxNet CIFAR10 image classification with Resnet50 model. Following a synchronous data-parallel Stochastic Gradient Descent implementation configured on mini-batches of 256 samples per GPU. Fortunately MxNet examples make easy to achieve this goal.
According to the previous benchmark, p2.8x scaled near linearly with the number of GPUs. With 90% efficiency in scaling up from 1 to 8 GPUs.
Otherwise, we realized larger batch sizes would be required in order to completely saturate the 8 K80 GPUs. But on expenses of non-trivial convergence difficulties we should deal to avoid accuracy degradation.
Some general recommendations can be made to efficiently scale training of Deep Learning models:
- Dataset matters: first of all Cifar10 images appears to be a too small dataset to use for benchmarking. As computation required for processing each sample is minimal. And it makes difficult to adequately saturate the GPU utilization. So it’s hard to exploit the possibilities of modern GPUs with such small dataset.
- Data pipeline matters: even the way to get data from hard disk can be a bottleneck too. Or issues related with image preprocessing. So you probably need Solid State Disks (SSD) and an efficient pipeline to feed your model.
- Model matters: usually bigger models increase both computation cost and network bandwith requirements. However, the correlation is not the same for all deep learning models. For instance, Alexnet requires a relatively small number of operations for a single forward pass. But it has a huge number of network parameters to share. So it’s specially bandwith demanding.
- Topology matters: for big models, with modern GPUs, bandwith of interconnections is actually the main bottleneck. Even within a single node there is a big difference between PCIe and recent NVlink interconnections. And going multinode latencies get much worse, so Infiniband or very high speed Ethernet connections (>25Gbit/s) are needed.
- Batch size and learning rates matter. In order to reduce communication overhead and effectively use all the computing parallelism GPUs offer, we require large batch sizes. But large batch sizes cause convergence difficulties to Stochastic Gradient Descent degrading accuracy. Recent paper from Facebook AI Research shows it’s possible to practically circumvent this situation by progressively increasing learning rates in a linear scale. Appart from many other considerations.
Besides, other implementation details also matter. And we will go deep on this topic on future posts. As devil is in details. Anyhow, all benchmarks are wrong, some are useful. And this post is going to exceed the barrier of 5 mins reading 😉
We have to thank Rafael and Francisco from Azken Muga. Thanks to NVIDIA engineers for providing us valuable feedback about this post and more tips about benchmarking and efficiently training deep neural networks. Thanks to Gorka from Ailoveu
Imagen principal: Pexels