Wednesday, January 8, 2014

Stream Processors and DBMS Persistence

High-Velocity Data—AKA Fast Data or Streaming Data—seems to be all the rage these days. With the increased adoption of Big Data tools, people have recognized the value contained in this data and they are looking to get that value in real-time instead of a time-shifted batch process that can often introduce a 6-hour (or more) delay in time-to-value.

High-velocity data has all of the earmarks of a big technological wave. The technology leaders are building stream processors. Venture firms are investing money in stream processing companies. And existing tech companies are jumping on the bandwagon and associating their products with this hot trend; making them buzzword compliant.

Some have asked whether high-velocity data will complement or replace Big Data. Big Data addresses pooled data, or data at rest. History tells us that there are different use cases and each will find their market. However, ceteris paribus, near real-time insights are far more valuable than delayed insights. For example, if a user is browsing a commerce website, it is much more valuable to processes the click-stream data and make real-time recommendations, than to send recommendations to that user in email six hours later. The same could be said for call centers, online games, sensor data, pretty much all insight is more valuable, the sooner you can get it and act upon it.

The early streaming processors—including Twitter Storm, Yahoo S4, Google MillWheel, Microsoft StreamInsight, Linkedin Samza, etc.—and their kissing-cousins the Complex Event Processors—including Software AG Apama, Linkedin Kafka, Tibco BusinessEvents, Sybase ESP, etc.—are now facing competition from Amazon “The Commoditizer”. Amazon’s offering is Kinesis. Not only does Amazon offer Kinesis as a service (no capital investment, no laborious set-up or management) it also streams the entire data set to S3 providing a moving 24-hour moving window of archived data.

Archiving the data in a file system is helpful, but not enough. Sure you can sift through that data and “re-process” it, but what you really want is traditional DBMS capabilities. You want the ability to interact with the data by querying it in an ad hoc manner. You want to run those queries across the most complete dataset possible. It is one thing for a stream processor to run processes like counts, but more complex ad hoc processes like “How many time did user Y do action X in the last 24 hours” are far more valuable. Obviously, applying DBMS capabilities to streaming data is a huge benefit, but at what cost?

Consider a stream processor handling one million data elements per second (each 100 bytes in size). Attempting to index and insert data of this velocity in a traditional database runs into serious trouble. The immediate answer is that one million inserts a second demands an in-memory DBMS. But now consider that the volume described above adds up to 8.4TB of data per day! If you were to store that in DRAM—according to Wikipedia—it would cost you $126,000 for the DRAM alone. That same data would only cost $276 in disk, a 456-TIMES cost advantage. This explains why Amazon Kinesis is simply streaming to a disk-based file system (S3) instead of using a DBMS.

Some might argue that the trend is their friend, because DRAM is getting cheaper. Well DRAM prices have dropped about 33% per year, until 2012 when they started flat-lining and actually increasing. 
More importantly, data volumes have increased 78% per year and are projected to continue doing so (see IDC and Gartner). With compounding, even if we assume a 33% annual decrease in DRAM prices, the growth in data makes it, on a relative basis, 3-TIMES more expensive to store data in DRAM in 3-Years and 18-TIMES more expensive in 10 years. So no, the trend is not your friend using DRAM for streaming data storage.
 ScaleDB has developed a Streaming Table technology that enables DBMS persistence of streaming data using standard disk-based storage. It delivers the performance of DRAM—one million inserts per second—but with the 456-TIMES cost advantage of using disk-based storage. Since ScaleDB extends MySQL, this means that the streamed data is now available to any BI applications that support MySQL. We believe that the 456-TIMES cost advantage enabled by disk-based media is a game changer in bringing DBMS capabilities to bear on streaming data. 

Friday, July 12, 2013

Why You Should Embrace Database Virtualization

This article addresses the benefits provided from database virtualization. Before we proceed however, it is important to explain that database virtualization does NOT mean simply running a DBMS inside a virtual machine.

Database Virtualization, More Than Running a DBMS in a Virtual Machine
While running a DBMS in a VM can provide advantages (and disadvantages) it is NOT database virtualization. Typical databases fuse together the data (or I/O) with the processing (CPU utilization) to operate as a single unit. Simply running that single unit in a VM does not provide the benefits detailed below. That is not database virtualization that is merely server virtualization.

An Example of the Database Virtualization Problem
Say you have a database handling banking and I have $10MM in the bank (I wish). Now let’s assume that the bank is busy, so it bursts that database across 3 VM nodes in typical cloud-style.  Now each of those 3 nodes gets a command to wire out the full $10MM. Each node sees its balance at $10MM, so each one wires out the full amount, for a total wire transfer of $30MM…see the problem? In order to dynamically burst your database across nodes, you need a distributed locking mechanism so that all nodes see the same data and can lock other nodes from altering the same data independently. This sounds easy, but making it perform well is a massive undertaking. Only two companies have solved this problem: Oracle RAC and ScaleDB (for MySQL or MariaDB).

Defining Database Virtualization
  • It should enable the application to talk to a single virtual instance of the database, when in fact there are N number of actual nodes acting over the data.
  • It should separate the data processing (CPU) from the data (I/O) so that each can scale on demand and independently from the other.
  • For performance it should enable the actual processing of the data to be distributed to the various nodes on the storage tier (function shipping) to achieve maximum performance. Note: in practice, this is similar to MapReduce.
  • It should provide tiered caching, for performance, but also ensure cache coherence across the entire cluster.
Benefits of Database Virtualization

Higher Server Utilization: When the data is fused to the CPU, as a single unit, that one node is responsible for handling all usage spikes for its collection of data. This forces you to spit the data thinly, across many servers (siloes), forcing you to run each server at a low utilization rate. Database Virtualization decouples the data from the processing so that the spike in usage can be shared across many nodes on the fly. This enables you to run a virtualized database at a very high utilization rate.

Reduced Infrastructure Costs: Database virtualization enables you to use fewer servers, less power, less OS, tools, application licenses, network switches and storage, among other things.

Reduced Manpower Costs: Database virtualization simplifies the DBA’s job, since it uses only one schema and no sharding, it also simplifies backup processes, enabling the DBA to handle more databases. It reduces the application developer’s job because it eliminates code related to sharding: e.g. database routing, rebuilding relationships between shards (e.g. joins), and more. It also simplifies the network admin’s job because he manages fewer servers and they are identical.
Reduced Complexity: You only have a single database image, so elastically scaling up/down is simple and fast.

Increased Flexibility: Database virtualization brings the same flexibility to the database that server virtualization brings to the application tier. Resources are allocated and reallocated on the fly. If your usage profile changes, e.g. payroll one day, benefits the next, a virtual database uses the same shared infrastructure for any workload, while a traditional database does not.

Quality of Service: Since database images can move on the fly, without downtime, a noisy neighbor or noisy network is solved by simply moving the database to another node in your pool.

Availability: Unlike a traditional database, virtualized database nodes see all of the data, so they inherently provide failover for one another, addressing unplanned downtime. In regards to planned downtime, simply move the process to another server and take down the one that needs service, again without interruption.

Improved Performance: Because the pooled cache across the storage tier uses a Least Recently Used (LRU) algorithm, it can free up huge amounts of pooled cache to the then current workload, enabling near in-memory performance.  Also, as mentioned above, the distribution of processing to the storage tier enables high-performance parallel processing.

True database virtualization delivers a huge set of advantages that in many ways mirror the benefits server virtualization provides to applications. For this reason, we expect database virtualization to be the next big thing, following in the footsteps of server, storage and network virtualization.

Additional Resources:

Saturday, July 6, 2013

Don't Fall for the Fake Loan Fraud

We have been approached by a Sheik (claimed to be Sheikh A. R. Khalid Bin Mahfouz) and an investment banker out of London (who claimed his name is Harry Holt). But they change names faster than you change your underwear. Both were very excited to invest in the company. The Sheik wanted an equity investment, but we had to set-up a bank account in Asia somewhere first, which would have had a minimum account deposit. The “investment banker” needed his upfront money for the attorney to draft the agreement. I think he even had a real attorney who said she did need money before she would draft any agreement.

What to look for in a fake loan scam:

  1. Minimal due diligence, eager to invest large sums of money
  2. They use generic email address (Yahoo, Gmail, etc.) not tied to a company
  3. They have little to no Internet footprints (linkedin, search, etc.)
  4. They have minimal if any documents/brochures. The one from “Harry” was amusing. I copied sections of its text and found them word for word on various investment and VC websites. It was a plagiarized patchwork quilt of other websites.
  5. The term sheet from “Harry” looked pretty good because it was word for word from  book on the Internet (Note: Search for this line, on Google: “Bullet repayment at the Redemption Price, plus any and all accrued but unpaid interest at the Maturity Date, subject to the mandatory prepayment provisions”)
  6. You can ask for references (which they will ignore or maybe give fake ones) but don’t waste your time.
If you want to check deeper, look into the email headers. If you are using gmail you just click the drop down on the right and select show original to see the headers. or Interfraud ( If you are contacted for one of these fake loan/investment scams, copy the headers and send them to the email address posted at the interfraud site so they can have their email shut down.

Then you can copy their IP address and search for it. It will probably show up in Project Honeypot.

If you still aren’t sure because you want to believe that someone fell in love with you or your company and cannot wait to invest, inform them that: “My board of directors will not disburse any upfront money for any purpose, or set-up any bank account. Any fees MUST be paid out of proceeds from your investment and payment will occur no sooner than 30-days after the wire clears our account.” That should cause them to decide to pass on your opportunity because you are “unreasonable”.

It is truly despicable that these kind of heartless scum play on the hopes and dreams of entrepreneurs to extract their money. But I guess they wouldn’t play this game if there weren’t suckers out there falling for it regularly enough. Don’t be a sucker.

Wednesday, June 5, 2013

Problems with Open Source: Part 2

In my prior post on the problems with open source, I wrote that one issue that impacts open source revenues is the macro economy, and how a declining or difficult macro economy can result in reduction of revenues to open source companies. The following article talks about how financially troubled Spain is saving a "fortune" by moving to open source. The Spanish government's savings are coming at the expense of proprietary server software companies--most likely Microsoft--but I would be willing to bet that none of this "savings" is flowing to the open source vendors. That is what happens in a difficult macro economy.

Thursday, May 30, 2013

Problems with Open Source

Monty Widenius wrote about the problems with the open sourcemodel, or more specifically the problems he is experiencing with his open source project MariaDB. In a nutshell, it lacks two things: (1) developers committing code; (2) users paying. He then focuses primarily on #2, lack of paying customers.

I believe that Monty’s concerns are the result of a number of factors: 

  1.  Maturity (coolness factor): When a product is new and cool, developers want to work on it and customers get in the spirit and want to pay for it to continue to evolve. But once it becomes mature…eh not so much.
  2. Maturity (downstream revenues): When a product is new and cutting-edge, “experts” make a ton of money. Look at Hadoop experts now. But as it becomes mainstream, the experts are making far less and feel less charitable toward their respective open source project.
  3. Maturity (market adoption): When you are one of the few early adopters of an open source project you may be more charitable toward the company in an effort to see it survive. Once it gains universal appeal, you figure that the rest of the people will pay so you don’t need to…in other words, “they are a success now, no need to continue funding them.”
  4. Macro Economy: If the macro economy is tight, as it is now, and companies are looking for where to cut, it is easier to cut funding to an “optional donation” than to cut one more individual. This is similar to the “downstream revenues” issue above but at the company level.

Open source projects follow a cycle, just like most everything in life. Commercial products achieve peak revenues with maturity and broad adoption. I believe that open source projects are the inverse, with maturity comes a decline in revenues. Ironically, it could well be that success is dangerous to a company's health.

Monty has some interesting ideas on separating "free" from open least to some degree.

Monday, May 6, 2013

Large Database

Just a heads-up that we have added a Large Database page on ScaleDB that talks about many of the issues facing people trying to implement a large database, such as design, backup/restore, to index or not to index, and much more. Enjoy.

Thursday, May 2, 2013

Thoughts on Xeround and Free!

Everybody loves free. It is the best marketing term one could use. Once you say “FREE” the people come running. Free makes you very popular. Whether you are a politician offering something for free, or a company providing free stuff, you gain instant popularity.

Xeround is shutting down their MySQL Database as a Service (DBaaS) because their free instances, while popular, simply did not convert into sufficient paid instances to support the company. While I am sad to see them fail, because I appreciate the hard work required to deliver database technology, this announcement was not unexpected.

My company was at Percona Live, the MySQL conference, and I had some additional conversations along these same lines. One previously closed source company announced that they were open sourcing their code, it was a very popular announcement. A keynote speaker mentioned it and the crowd clapped excitedly. Was it because they couldn’t wait to edit the code? Probably not. Was it because now the code would evolve faster? Probably not, since it is very low-level and niche oriented, and there will be few committers. No, I think it was the excitement of “free”. The company was excited about a 49X increase in web traffic, but had no idea what the impact would be on actual revenues.

I spoke with another company, also a low-level and niche product, and they have been open source from the start. I asked about their revenues, they are essentially non-existent. Bottom line is that the plan was for them to make money on services…well Percona, Pythian, SkySQL and others have the customer relationships and they scoop up all of the consulting and support revenue while this company makes bupkis. I feel for them.

I had a friend tell me that ScaleDB should open source our code to get more customers. Yes open source gets you a lot of free users…not customers. It is a hard path to sell your first 10...25…50…etc. customers, but the revenue from those customers fuels additional development and makes you a fountain of technology. Open source and free are great for getting big quickly and getting acquired, but it seems that if the acquisition doesn’t happen, then you can quickly run out of money using this model (see Xeround).

I realize that this is an unpopular position. I realize that everybody loves free. I realize that open source has additional advantages (no lock-in, rapid development, etc.), but in my opinion, open source works in only two scenarios: (1) where the absolute volume is huge, creating a funnel for conversion (e.g. Linux); (2) where you need to unseat an entrenched competitor and you have other sources of revenue (e.g. OpenStack).

I look forward to your comments. We also look forward to working with Xeround customers who are looking for another solution.