To The Stars

BY MATT WOOD 

Scaling AI

Introducing P2: Better AI in Less Time

Last night, we introduced a new instance family to Amazon EC2: P2. There is a lot of firepower in these new instances for floating point performance: 16 GK219 GPUs (on 8 Tesla K80s), with a combined 192GB of GPU memory, 39,936 parallel processing cores, 70 teraflops of single precision floating point performance, over 23 teraflops of double precision floating point performance, and GPUDirect technology for higher bandwidth and lower latency peer-to-peer communication between GPUs.

That last part is important, as it paves the way for dramatic improvements in artificial intelligence, where large, data intensive workloads play such an important role in training. These improvements allow larger data sets to be used in practice, improving the quality and scope of intelligent apps.

Scaling AI with Multiple GPUs

For example, the more images and metadata you can use to train your computer vision system, the better that system will be able to see and understand the world around it. GPUs are tuned for fast operations on multi-dimensional vectors (and they're really good at it, by design), but in some cases, all those GPU cores can process the data faster than they can access it.

Physically accessing or copying data around can become a rate limiting step when scaling to larger numbers of cores, or larger data sets. This can become an intractable issue when training new artificial intelligence models: long training times make it harder to iterate on algorithm development, or to retrain and improve existing models with fresh, new (and increasing large), datasets.

The P2 instances address this issue by improving the throughput of data to the cores on the GPUs (through higher bandwidth, lower latency and smart CUDA memory management). The better that works, the more data we can practically use to develop novel applications, and improve existing artificial intelligence.

Let's take a look at the impact this has on artificial intelligence. We trained different image classification convolutional networks (using MXNet and DeepMark), using single instances of G2.8xlarge and P2.8xlarge; the numbers below show the images processed per second, in a single iteration (forward, backwards and update). We presented as large a batch as memory would allow on the G2, and dramatically ramped up the data volumes for P2, with a batch size four times larger. 

                    G2          P2          Increase
Resnet 152          84.9        311.8       3.67x
Alexnet             927.5       4052.6      4.37x
VGG                 24.4        126.5       5.20x

These are sizable gains (the results on Alexnet and VGG are better because they require more GPU bandwidth than resnet): we were able to increase the number of images processed per second by between 3.67 and 5.2 times on a P2. That translates to better models with less training time for a broad set of AI use cases. In fact, we could have pushed this further by increasing the number of images in each batch, but in practice, larger batch sizes are often undesirable as they can affect the convergence of the algorithm.

We can illustrate the scaling character of the P2 GPU cores by running the Resnet 152 example on different numbers of GPUs. In an ideal world, we would see performance increase linearly as the number of GPUs increased. You can see that Resnet, built again using MXNet, does a pretty good job of matching that line. Scaling to 16 GPUs gives a 14.4x increase in speed overall.

Available Today

With more cores, improved memory bandwidth, lower latency and better memory management, P2 is an ideal platform for building AI models. With up to 20 Gigabit connectivity, utilizing multiple GPUs across multiple instances can scale workloads even further. 

P2 instances are available for all customers in the US East (Northern Virginia), US West (Oregon), and Europe (Ireland) Regions as On-Demand Instances, Spot Instances, Reserved Instances, or Dedicated Hosts.


Epilogue

A note on P2 and CPUs: We've talked about GPUs in this post, but since these workloads often also have a strong CPU component, P2 instances feature up to 732 GB of host memory, and up to 64 vCPUs using custom Intel Xeon E5-2686 v4 Broadwell processors.

A note on G2 instances: these are a great option today for 3D application streaming, video encoding, and other server-side graphics workloads.

 

updates

10/3/2016: Clarification on the number of GK210 GPUs (16), and Tesla K80s (8).

 

Serverless AI