Wednesday, January 8, 2014

Stream Processors and DBMS Persistence

High-Velocity Data—AKA Fast Data or Streaming Data—seems to be all the rage these days. With the increased adoption of Big Data tools, people have recognized the value contained in this data and they are looking to get that value in real-time instead of a time-shifted batch process that can often introduce a 6-hour (or more) delay in time-to-value.

High-velocity data has all of the earmarks of a big technological wave. The technology leaders are building stream processors. Venture firms are investing money in stream processing companies. And existing tech companies are jumping on the bandwagon and associating their products with this hot trend; making them buzzword compliant.

Some have asked whether high-velocity data will complement or replace Big Data. Big Data addresses pooled data, or data at rest. History tells us that there are different use cases and each will find their market. However, ceteris paribus, near real-time insights are far more valuable than delayed insights. For example, if a user is browsing a commerce website, it is much more valuable to processes the click-stream data and make real-time recommendations, than to send recommendations to that user in email six hours later. The same could be said for call centers, online games, sensor data, pretty much all insight is more valuable, the sooner you can get it and act upon it.

The early streaming processors—including Twitter Storm, Yahoo S4, Google MillWheel, Microsoft StreamInsight, Linkedin Samza, etc.—and their kissing-cousins the Complex Event Processors—including Software AG Apama, Linkedin Kafka, Tibco BusinessEvents, Sybase ESP, etc.—are now facing competition from Amazon “The Commoditizer”. Amazon’s offering is Kinesis. Not only does Amazon offer Kinesis as a service (no capital investment, no laborious set-up or management) it also streams the entire data set to S3 providing a moving 24-hour moving window of archived data.

Archiving the data in a file system is helpful, but not enough. Sure you can sift through that data and “re-process” it, but what you really want is traditional DBMS capabilities. You want the ability to interact with the data by querying it in an ad hoc manner. You want to run those queries across the most complete dataset possible. It is one thing for a stream processor to run processes like counts, but more complex ad hoc processes like “How many time did user Y do action X in the last 24 hours” are far more valuable. Obviously, applying DBMS capabilities to streaming data is a huge benefit, but at what cost?

Consider a stream processor handling one million data elements per second (each 100 bytes in size). Attempting to index and insert data of this velocity in a traditional database runs into serious trouble. The immediate answer is that one million inserts a second demands an in-memory DBMS. But now consider that the volume described above adds up to 8.4TB of data per day! If you were to store that in DRAM—according to Wikipedia—it would cost you $126,000 for the DRAM alone. That same data would only cost $276 in disk, a 456-TIMES cost advantage. This explains why Amazon Kinesis is simply streaming to a disk-based file system (S3) instead of using a DBMS.

Some might argue that the trend is their friend, because DRAM is getting cheaper. Well DRAM prices have dropped about 33% per year, until 2012 when they started flat-lining and actually increasing. 
More importantly, data volumes have increased 78% per year and are projected to continue doing so (see IDC and Gartner). With compounding, even if we assume a 33% annual decrease in DRAM prices, the growth in data makes it, on a relative basis, 3-TIMES more expensive to store data in DRAM in 3-Years and 18-TIMES more expensive in 10 years. So no, the trend is not your friend using DRAM for streaming data storage.
 ScaleDB has developed a Streaming Table technology that enables DBMS persistence of streaming data using standard disk-based storage. It delivers the performance of DRAM—one million inserts per second—but with the 456-TIMES cost advantage of using disk-based storage. Since ScaleDB extends MySQL, this means that the streamed data is now available to any BI applications that support MySQL. We believe that the 456-TIMES cost advantage enabled by disk-based media is a game changer in bringing DBMS capabilities to bear on streaming data. 


  1. You are bang on the point, after working with a couple of customers I realised the same thing, nobody wants to process the data they cannot analyze. For this whole stack I used Spark for streaming and Shark for querying for 2 reasons: both run on a common stack and shark supports a hybrid in memory and disk model allowing you to cache latest data and aggregates in memory. Overall this helped bridge the gap of small ram and big data.

  2. There are a number of approaches to solve this challenge. For businesses to adopt Stream (Fast) Data will require a SQL interface because business users have the expertise, pool of employable people, tools, etc. around SQL. This was proven with Hadoop as well. Hadoop is doing some things to provide near real-time, but their underlying large-grain file structure is not conducive to streaming small bits of data and accessing it in a granular fashion.