Database Virtualization, More Than Running a DBMS in a
Virtual Machine
While running a DBMS in a VM can provide advantages (and
disadvantages) it is NOT database virtualization. Typical databases fuse
together the data (or I/O) with the processing (CPU utilization) to operate as
a single unit. Simply running that single unit in a VM does not provide the
benefits detailed below. That is not database virtualization that is merely
server virtualization.
An Example of the Database Virtualization Problem
Say you have a database handling banking and I have $10MM in
the bank (I wish). Now let’s assume that the bank is busy, so it bursts that
database across 3 VM nodes in typical cloud-style. Now each of those 3 nodes gets a command to
wire out the full $10MM. Each node sees its balance at $10MM, so each one wires
out the full amount, for a total wire transfer of $30MM…see the problem? In
order to dynamically burst your database across nodes, you need a distributed
locking mechanism so that all nodes see the same data and can lock other nodes
from altering the same data independently. This sounds easy, but making it
perform well is a massive undertaking. Only two companies have solved this
problem: Oracle RAC and ScaleDB (for MySQL or MariaDB).
Defining Database Virtualization
- It should enable the application to talk to a single virtual instance of the database, when in fact there are N number of actual nodes acting over the data.
- It should separate the data processing (CPU) from the data (I/O) so that each can scale on demand and independently from the other.
- For performance it should enable the actual processing of the data to be distributed to the various nodes on the storage tier (function shipping) to achieve maximum performance. Note: in practice, this is similar to MapReduce.
- It should provide tiered caching, for performance, but also ensure cache coherence across the entire cluster.
Benefits of Database Virtualization
Higher Server Utilization: When the data is fused to the CPU, as a single unit, that one node is responsible for handling all usage spikes for its collection of data. This forces you to spit the data thinly, across many servers (siloes), forcing you to run each server at a low utilization rate. Database Virtualization decouples the data from the processing so that the spike in usage can be shared across many nodes on the fly. This enables you to run a virtualized database at a very high utilization rate.
Reduced Infrastructure Costs: Database virtualization enables you to use fewer servers, less power, less OS, tools, application licenses, network switches and storage, among other things.
Reduced Manpower Costs: Database virtualization simplifies the DBA’s job, since it uses only one schema and no sharding, it also simplifies backup processes, enabling the DBA to handle more databases. It reduces the application developer’s job because it eliminates code related to sharding: e.g. database routing, rebuilding relationships between shards (e.g. joins), and more. It also simplifies the network admin’s job because he manages fewer servers and they are identical.
Reduced Complexity: You only have a
single database image, so elastically scaling up/down is simple and fast.
Increased Flexibility: Database virtualization brings the same flexibility to the database that server virtualization brings to the application tier. Resources are allocated and reallocated on the fly. If your usage profile changes, e.g. payroll one day, benefits the next, a virtual database uses the same shared infrastructure for any workload, while a traditional database does not.
Quality of Service: Since database images can move on the fly, without downtime, a noisy neighbor or noisy network is solved by simply moving the database to another node in your pool.
Availability: Unlike a traditional database, virtualized database nodes see all of the data, so they inherently provide failover for one another, addressing unplanned downtime. In regards to planned downtime, simply move the process to another server and take down the one that needs service, again without interruption.
Improved Performance: Because the pooled cache across the storage tier uses a Least Recently Used (LRU) algorithm, it can free up huge amounts of pooled cache to the then current workload, enabling near in-memory performance. Also, as mentioned above, the distribution of processing to the storage tier enables high-performance parallel processing.
True database virtualization delivers a huge set of
advantages that in many ways mirror the benefits server virtualization provides
to applications. For this reason, we expect database virtualization to be the
next big thing, following in the footsteps of server, storage and network
virtualization.
Additional Resources: