Lambda: The Infinite Core CPU
I've written before about how AWS Lambda is proving to be a powerful tool for data-intensive workloads (such as model selection in machine learning). Parallelization with Lambda is as easy as executing as many functions as you need to cover the full depth and breadth of your dataset, in real time as it grows. It's like having a CPU with virtually infinite cores.
Although each Lambda function has a relatively small footprint (1.5 gig of RAM, 300 seconds of execution time), the real power is pulling them together into a large, parallel system. They are the lionbots, coming together to form Voltron.
Eric Jonas, a postdoctoral researcher at the legendary AMP Lab, had a great illustration of this, pulling 25 TFLOPs of performance using plain-old Python functions, with over 6O GB/sec read and 50 GB/sec write to S3. That's close to in-memory speeds, with nearly linear scaling of read and write throughput to S3.
Graphs from Eric's blog, showing peak and effective performance on Lambda using PyWren, and execution times per function, with thanks for letting me re-use them here.
The open source framework, PyWren, effectively implements the map of map/reduce on Lambda; the full code is available and well worth checking it out.
Map/Reduce With Lambda
We're also making a new reference architecture for map/reduce on Lambda available today. Nicknamed 'BigLambda', this reference architecture takes a mapper and a reducer, and scales them against data in S3, for ad-hoc analysis at scale, with zero-administration.
We ran some simple benchmarks using the Big Data Benchmark, of 126.8 GB of data across 775M rows. All times are in seconds.
Technology | Scan 1a | Scan 1b | Aggregate 2a Amazon Redshift (HDD) | 2.49 | 2.61 | 25.46 Impala - Disk - 1.2.3 | 12.015 | 12.015 | 113.72 Impala - Mem - 1.2.3 | 2.17 | 3.01 | 84.35 Shark - Disk - 0.8.1 | 6.6 | 7 | 151.4 Shark - Mem - 0.8.1 | 1.7 | 1.8 | 83.7 Hive - 0.12 YARN | 50.49 | 59.93 | 730.62 Tez - 0.2.0 | 28.22 | 36.35 | 377.48 "BigLambda" | 39 | 47 | 200
As you can see for these queries, the reference implementation performs reasonably well; it's nowhere near Redshift performance for the same queries, but for the price it really can't be beat today:
- Scan 1a: $0.00477
- Scan 1b: $0.0055
- Aggregate 2a: $0.1129
The reference architecture is available on GitHub.
An Idea Whose Time Has Come
For further reading, there are other examples of using Lambda for serverless map/reduce: FireEye wrote up a nice post on the AWS Big Data blog. Sungard also have an interesting piece on using Lambda with big data.
Take the reference implementation for a spin today.
Pro tip: By default, Lambda will allow 100 concurrent functions to execute; a safety limit to prevent run-away scripts. You can request this limit be lifted if you need more.