High-Velocity Data—AKA Fast Data or Streaming Data—seems to
be all the rage these days. With the increased adoption of Big Data tools,
people have recognized the value contained in this data and they are looking to
get that value in real-time instead of a time-shifted batch process that can
often introduce a 6-hour (or more) delay in time-to-value.
High-velocity data has all of the earmarks of a big
technological wave. The technology leaders are building stream processors.
Venture firms are investing money in stream processing companies. And existing
tech companies are jumping on the bandwagon and associating their products with
this hot trend; making them buzzword compliant.
Some have asked whether high-velocity data will complement or
replace Big Data. Big Data addresses pooled data, or data at rest. History tells
us that there are different use cases and each will find their market. However,
ceteris paribus, near real-time insights are far more valuable than delayed
insights. For example, if a user is browsing a commerce website, it is much
more valuable to processes the click-stream data and make real-time
recommendations, than to send recommendations to that user in email six hours
later. The same could be said for call centers, online games, sensor data,
pretty much all insight is more valuable, the sooner you can get it and act
upon it.
The early streaming processors—including Twitter Storm,
Yahoo S4, Google MillWheel, Microsoft StreamInsight, Linkedin Samza, etc.—and their
kissing-cousins the Complex Event Processors—including Software AG Apama,
Linkedin Kafka, Tibco BusinessEvents, Sybase ESP, etc.—are now facing
competition from Amazon “The Commoditizer”. Amazon’s offering is Kinesis. Not
only does Amazon offer Kinesis as a service (no capital investment, no laborious
set-up or management) it also streams the entire data set to S3 providing a
moving 24-hour moving window of archived data.
Archiving the data in a file system is helpful, but not
enough. Sure you can sift through that data and “re-process” it, but what you
really want is traditional DBMS capabilities. You want the ability to interact
with the data by querying it in an ad hoc manner. You want to run those queries
across the most complete dataset possible. It is one thing for a stream
processor to run processes like counts, but more complex ad hoc processes like “How
many time did user Y do action X in the last 24 hours” are far more valuable.
Obviously, applying DBMS capabilities to streaming data is a huge benefit, but
at what cost?
Consider a stream processor handling one million data
elements per second (each 100 bytes in size). Attempting to index and insert
data of this velocity in a traditional database runs into serious trouble. The
immediate answer is that one million inserts a second demands an in-memory
DBMS. But now consider that the volume described above adds up to 8.4TB of data
per day! If you were to store that in DRAM—according to Wikipedia—it would cost
you $126,000 for the DRAM alone. That same data would only cost $276 in disk, a
456-TIMES cost advantage. This explains why Amazon Kinesis is simply streaming
to a disk-based file system (S3) instead of using a DBMS.
Some might argue that the trend is their friend, because
DRAM is getting cheaper. Well DRAM prices have dropped about 33% per year,
until 2012 when they started flat-lining and actually increasing.
More
importantly, data volumes have increased 78% per year and are projected to
continue doing so (see IDC and Gartner). With compounding, even if we assume a 33% annual decrease in
DRAM prices, the growth in data makes it, on a relative basis, 3-TIMES more
expensive to store data in DRAM in 3-Years and 18-TIMES more expensive in 10
years. So no, the trend is not your friend using DRAM for streaming data
storage.
You are bang on the point, after working with a couple of customers I realised the same thing, nobody wants to process the data they cannot analyze. For this whole stack I used Spark for streaming and Shark for querying for 2 reasons: both run on a common stack and shark supports a hybrid in memory and disk model allowing you to cache latest data and aggregates in memory. Overall this helped bridge the gap of small ram and big data.
ReplyDeleteThere are a number of approaches to solve this challenge. For businesses to adopt Stream (Fast) Data will require a SQL interface because business users have the expertise, pool of employable people, tools, etc. around SQL. This was proven with Hadoop as well. Hadoop is doing some things to provide near real-time, but their underlying large-grain file structure is not conducive to streaming small bits of data and accessing it in a granular fashion.
ReplyDelete