Friday, February 26, 2010

Will the NoSQL Movement Unseat the Database Behemoths?

With the introduction of each new platform, comes the opportunity for new thinking, new applications and new winners. DEC and Oracle were beneficiaries of the move to the minicomputer. Microsoft was the main beneficiary of the move to the PC. Sun rode the workstation to fame. Today’s exciting new platform is the cloud, and one of the upstart contenders is NoSQL.

One might argue that the cloud is merely the hosting of well established platforms such as the PC. Larry Ellison has made this very claim. However, the cloud is very different.

How is the cloud different? Sometimes when you combine things, the combination is very different than the components. For example, Salt (NaCl) is very different from its poisonous individual components. Cloud computing enjoys a similar combinatory effect. Sure it is merely a mixture of PC platforms, virtualization, lots of Linux and low-cost scalable disk arrays. But the combination is more about dynamic on-demand elasticity, elimination of capital expense, instant access to compute resources (versus slow hardware requisitioning), reduced IT headcount hassles, etc. In other words, cloud computing is no longer about the components, it is more about changing how we think about and use computing resources; it is a new paradigm for the consumption of computing resources.

With this new paradigm, comes a new mentality. Cloud developers expect that all aspects of the cloud to scale dynamically. This is where the shared-nothing SQL database comes up short. It is also where the NoSQL option excels.

We in the SQL world could easily dismiss NoSQL, saying NoSQL = NoEnterprise. How can you build a real application on something that doesn’t offer transactions, data consistency, SQL, etc. Real database people turn up their noses at those little key-value pair NoSQL toys. Not so fast.

SimpleDB just fired a shot across the bow of the database big boys with forced consistency. Sure you pay a price for this, and it should only be invoked when it is truly needed, but the point is you CAN do it. The history of technology is littered with the bodies of high-end products that were cannibalized from below, as lighter-weight platforms won the price/volume game. Cloud will definitely win the price/volume game; you simply cannot beat the economics. The question is who will win the cloud database war.

NoSQL databases (e.g. Cassandra, SimpleDB, BigTable, CouchDB, Mongo DB, etc.) will continue to nibble away at the rationale for sticking with big SQL databases. As the leading web database, MySQL became the de facto cloud database, since web and Web 2.0 applications were the early adopters of the cloud. But MySQL cannot rest on its laurels. NoSQL solutions are nipping at MySQL’s heels and their dynamic elasticity is quite appealing.

Now enterprise customers are beginning to move to the cloud. At the same time, NoSQL solutions are adding capabilities once reserved to relational databases. This raises a LOT of questions:

1. Will NoSQL undermine its scalability as it adds more enterprise capabilities (Will these extensions bolt on smoothly or will they result in an awkward and ultimately unscalable Frankenstein)?

2. Will the big SQL database vendors continue to dismiss NoSQL as toys, or will they see them for the threat they are becoming (Should we expect the commercial database vendors to start buying NoSQL solutions)?

3. Will MySQL be the first to succumb to the NoSQL onslaught (Did Oracle just buy yesterday’s cloud database leader)?

4. Will a third-party candidate like ScaleDB, with its shared-disk architecture win with a “best of both worlds” approach that scales dynamically and provides enterprise SQL capabilities?

5. Will SQL and NoSQL co-exist as different tools for different problems, or with they evolve into direct competitors across most major segments?

My Thoughts:
At the moment, SQL databases and NoSQL are different tools for different problems. I think this remains the case, but I believe that NoSQL will spread its reach by adding capabilities that begin to eat into traditional relational database segments. I suspect that the large commercial database companies, after ignoring NoSQL for too long, will resort to buying some of them and integrating them into their product portfolios. Companies focused solely on worldwide scalability like Google, will remain wedded to NoSQL, because any technology that doesn’t scale to 10,000 servers is a non-starter. Enterprises will take a “right tool for the job” approach, employing all of the above.

NoSQL and map-reduce technologies will excel in non-transactional roles like data warehouses, business intelligence (DW/BI). In the OLTP space, SQL databases will remain far more prominent. However, the pain of dynamically scaling shared-nothing databases—and sharding is a pain—will create a need for the dynamically elastic shared-disk databases like ScaleDB. The sweet spot for shared-disk probably peaks at about 80-100 database servers. This level of scaling should be sufficient for all but the largest companies. Beyond that, NoSQL (utilizing little or no scale-limiting constraints like forced consistency) will be the only option.

I would love to hear your thoughts in the comments section below…


  1. "NoSQL and map-reduce technologies will excel in non-transactional roles like data warehouses, business intelligence (DW/BI)."

    Well maybe...but I can only see this fly in the ETL part of the process...reporting and ROLAP tools don't change overnight to suddenly run their queries on a map/reduce based back-end. For now you're still going to create datamarts like you used to: baking star or snowflake schemas based on relational databases.

  2. What are you guys doing to secure a deal with Rackspace (rhetorical)? RackspaceCloud manages master slave pairs of MySQL nodes that have pretty hard scale up caps. Don't get greedy with the licensing and have a sustainable business with a world class customer that actually needs you. Those Wordpress installs are not moving off of MySQL!

    There is an opportunity to get in the door with their scale out efforts in the Castle datacenter that will be closing soon. Get moving!

  3. This comment has been removed by a blog administrator.

  4. Roland, Good point, the installed base of tools (not to mention developers) creates a tremendous inertia against moving to NoSQL. I didn't mean to imply that they would dominate the SQL alternatives in DW/BI. In fact, companies like Greenplum and Asterdata are integrating map-reduce capabilities under the covers.

    Tlperkins, Thank you for your insight. We do not discuss partnerships that may or may not be in the works. That said, you are correct, the fit is quite good.

  5. How does cache locality affect scaledb? Suppose we have N compute instance, would I want to direct a subset of the workload to a subset of the compute instances or is network locality good enough and I can distribute across all N?

  6. ScaleDB enables you to use any node to satisfy any query and all data is available to each node. This attribute is ideal for fail-over, because no nodes represent a single point of failure and by losing nodes there is no expected loss in data accessibility. That said, there are certain advantages to routing selected db requests to certain nodes. This sort of locality can improve cache hit performance. For example, if an application has knowledge about the data it accesses and uses particular node(s) to only access that data, then the data will continue to reside in the cache of the node used and will help performance (as opposed to a fresh node having to read the data). In the case of reads, this can be particularly useful because a node can fully cache table data for great performance. In the case of data being updated, the node may need to get the latest version and apply locking (unless the node being used was the last to lock the data). Again, performance increases could be gained by applications having SQL statement affinity to particular nodes.

    Application driven locality is a tuning option in the shared-disk database. It is more flexible and easier to implement than partitioning a shared-nothing database. At the same time, you have the option to fail-over to any node or to redirect to any node in the case that you are overloading a specific locality. In short, locality is not necessary, but it can be used as a very flexible performance tuning tool.

  7. What is surprising here, is that there is no mention of Windows Azure Storage, which also allows application developers to store their data in the cloud, so the application can access its data from anywhere at any time, store any amount of data and for any length of time, and be confident that the data is durable and will not be lost. Windows Azure Storage provides a rich set of data abstraction, including Windows Azure Table, which is the structured storage provided by the Windows Azure platform. It supports massively scalable tables in the cloud, which can contain billions of entities and terabytes of data. Windows Azure Tables support LINQ, ADO .NET Data Services and RESTful queries, so you could leverage the NoSQL model.

    Saying NoSQL = NoEnterprise arguably would be a moot point today, especially given the fact that large-scale distributed productions systems, such as Facebook and Flicker, have been built on the movement to NoSQL. (Facebook has built Cassandra, a peer-to-peer data store with a BigTable-like data model built on a Dynamo-like infrastructure. Flicker has such a MySQL sharding approach that provides a hybrid SQL/NoSQL approach to scalability, by partitioning the data across many servers, therefore data must be prepartitioned, just like in SimpleDB.)

  8. The thinking behind the NoSQL = NoEnterprise has nothing to do with raw scaling, a la Facebook. Instead it is a reference to the shortcomings in NoSQL with regard to consistency, transactions, use of structured tabular data that can be accessed by other applications (e.g. reporting), etc.