Friday, November 27, 2009

Virtual Databases: The Face of the New Cloud Database

Shared-disk databases can be virtualized—making them cloud-friendly—while shared-nothing databases are tied to a specific computer and a specific data set or data partition.

The underlying principle of the shared-nothing RDBMS is that a single master server owns its specific set of data. That data is not shared, hence the name shared-nothing. Because there is no ability to share the data, there is also no ability to virtualize the computing of that data. Instead the shared-nothing RDBMS ties the data and the computing to a specific computer. This association with a physical machine is then reinforced at the application level. Applications leveraging a shared-nothing database, that is partitioned across more than one server, use routing code. Routing code simply directs the various database requests to the servers that own the data being requested. In other words, the application must know which server owns which piece of data. This further reinforces the mismatch between shared-nothing databases and virtualization.

This is not to say that it is impossible to virtualize a shared-nothing database. As any software architect will tell you, “You can do anything in software…” The second part of that statement is “…but it may not perform or scale well, and it may make maintenance very painful.” The latter part of that statement is exactly what you will find with any effort to virtualize a shared-nothing database. Attempts to insert layers of indirection will result in added complexity that makes maintenance a nightmare. Finding bugs, tuning performance, recovering from failure, all of these issues are severely compounded when you introduce layers of indirection in a shared-nothing database.

The performance, and hence the scalability are also undermined in this model. In order to support dynamic virtualization, you must mediate the requests from the application before they hit the database. This requires a piece of middleware that sniffs each database request and routes it to the appropriate server. What happens when a database request spans multiple servers? Suffice it to say it isn’t pretty, and it doesn’t perform well. This sort of request will result in a lot of data shipping and joins. The bottom line is that partitioning your database to achieve performance, scalability and maintenance is a black art, all attempts to automate this process have failed.

Compare this to the shared-disk DBMS. Shared-disk separates the compute from the storage. The data is stored in one big trough, while you can have any number of compute instances feeding on the entirety of that data. Because each node has access to all of the data, you don't need any middleware to route the database requests to specific servers. Furthermore, each of the compute nodes is identical, making them virtualization-friendly. If one node fails, the others recover the transactions, while the application continues uninterrupted. You can also add nodes on the fly, again without interrupting the application. For these reasons, the shared-disk RDBMS is ideal for virtualization, while the shared-nothing RDBMS is anathema to virtualization.

This is an excerpt from a white paper I'm writing that addresses virtualized cloud databases.

No comments:

Post a Comment