12th June, 2015
From television to David Bowie, many new or emerging phenomenon go through a transition from fledgling to mature, from an insider’s tip to a household name. Technology adoption follows a similar trend, and a good indicator a transition is a well established and pervasive answer to the question ’what is X?’.
We saw this quickly with cloud computing, as early adopters, analysts and vendors discussed and contributed to an evolving and changeable definition which sought to bring some shape and color to a related but disparate collection of technologies. For example, the one that seems to have stuck for the cloud, at least for now, focuses on infrastructure, platforms and software delivered as a service.
But, much like David Bowie’s metamorphosis from Ziggy Stardust to the Thin White Duke, technologies shift and change over time, and so it’s worth revisiting their syntax and definitions.
Big Data, like cloud computing, is a collection of technologies and tools which were polarized, adopted and accelerated by a specific opportunity (some might even say, a requirement), of developers and businesses to ask questions of increasingly complex data. The adopted, pervasive definition of big data focused around The Vs. I’ve seen as many as five of these referenced, but the most common are velocity, variety and volume (funny how these definitions seem to come in threes).
These three characteristics, originally intended to define the qualities of the data itself, have become synonymous with the challenges organizations faced when attempting to ask key questions of data which was being freshly generated, or which already resided inside their organization. Velocity, volume and variety weren’t celebrated as characteristics which would help answer increasingly complex problems, instead they were to be feared (a perspective some vendors deliberately propagated in an effort to sell their wares).
In an environment where data center walls can’t move, and where procurement and provisioning take months, you can understand why these factors start to loom large and contribute to defining a set of problems. Instead of asking the question you would like answer to, you end up having to scope your question around what your available resources can support. This is boxed-in thinking: “what can I ask given that I only have 100 cores and 10Tb?”, “how long can I leave this running for before I need the answer?”, or more insidious on shared resources, “what scope of resources will get me off the queue and on to the cluster soonest”. Yikes.
Constraints give way to creativity
I believe that cloud computing has come to be recognized as a key enabling foundation for ‘big data’ primarily because the velocity, volume and variety of data cease to be challenges (and in some cases, blockers) to working with that data. The resources required for data of any scale, at any volume and of virtually any complexity are immediately available to hand, and as a result the constraints of boxed-in thinking evaporate, and creative analysis, data exploration, reporting, data preparation, transformation, visualization become quicker and easier.
What next for Big Data?
As a result, we have entered a more mature era of ‘big data’ which is unbounded by many of the original complexities of scale or throughput, where the focus is on working productively with data. Instead of a focus on the data, today we are focused on working backwards from answers we’re looking for and in fitting together a collection of tools, techniques and best practices to let us ask the right questions.
The ability to work productively with data is the defining characteristic of ‘big data’ today.
The ability to be able to quickly develop, evaluate, adopt and scale new tools is an important part of productivity in big data. It’s tempting to think of data analytics as a linear timeline of generation, collection, storage, computation and collaboration, and at a high level, that’s correct. Take a look a step closer and you’re more likely to find a collection of diverse and evolving branching workflows which ultimately expose data to allow specialists inside an organize to interact with it in as productive was as possible. A business analyst who lives in Excel all day long, for example, has very different ways of working with data than a data scientist who is hacking on Python scripts. Within those workflows, there are three main categories of components:
Sources of truth: canonical stores of data which act as a single source of truth for a specific set of information. Commonly stored as objects, in a database or as part of a data warehouse.
Streaming data: Fresh data arriving as a stream of events which are either processed in some way (stored in logs, aggregated).
“Task” clusters: a cluster running a software stack which is tuned and optimized for a specific task. These can be ephemeral or long lived, and multiple clusters may be orchestrated to answer a specific question.
The cloud, of course, provides an environment which is well suited to each of these categories. Object storage is plentiful and cheap, databases are easily provisioned and how low management overheads, even at scale, services such as Kinesis seek to collect and process streaming data, and an platform such as EC2 (or its cousins, EMR and the EC2 Container Service), provides a perfect way to automatically provision and scale clusters with the right mix of resources for a specific task.
The result is that working with data rarely requires us to fill a round hole with a square peg; instead we can create and scale clusters which mix ElasticSearch with Hadoop, Spark with Splunk.
Tomorrow and today
A focus on productivity is a great sign of maturity for a group of technologies and techniques which are still relatively young, and leaves plenty of scope to experiment with tomorrow’s approaches while attempting to make today’s even easier to work with. It’s an exciting time to work in this world, and to bring about the sort of impact analytics is capable of, free of the scary monsters and super creeps of the ‘Vs’.