Spark

16th June, 2015

Earlier this week I posted about how the cloud can help remove the constraints of working with data productively. When it comes to big data tools or techniques, there are three variables that impact productivity, that is, the ability to get real work done efficiently:

Partnering infrastructure services which aim to meet these requirements with software that is designed from the outset to support (and in many cases, accelerate) the iterative nature of building applications is greater than the sum of its parts in terms of actually getting real work done.

Enter Spark

Apache Spark is one such a tool. If you’re unfamiliar, Spark uses a mixture of in-memory data storage (so called, resilient distributed data), graph based execution and a programming model designed to be easy to use. The result is a highly productive environment for data engineering and scientist to crunch data at scale (in some cases, 10x to 100x faster than Hadoop map/reduce).

Today, at the Spark Summing in San Francisco, it was a pleasure to announce that we’re coupling the speed of provisioning and broad resource mix of Amazon EMR with the iterative-friendly programming model of Apache Spark. More on the AWS blog.

Spark has already been put into production using EMR with folks such as Yelp, the Washington Post and Hearst, and I’m excited to see how better support in the console and EMR APIs help bring Spark to a broader audience.