Showing posts with label shared-disk. Show all posts
Showing posts with label shared-disk. Show all posts

Monday, July 25, 2011

ScaleDB: Shared-Disk / Shared-Nothing Hybrid

The primary database architectures—shared-disk and shared-nothing—each have their advantages. Shared-disk has functional advantages such as high-availability, elasticity, ease of set-up and maintenance, eliminates partitioning/sharding, eliminates master-slave, etc. The shared-nothing advantages are better performance and lower costs. What if you could offer a database that is a hybrid of the two; one that offers the advantages of both. This sounds too good to be true, but it is fact what ScaleDB has done.

The underlying architecture is shared-disk, but in many situations it can operate like shared-nothing. You see the problems with shared-disk arise from the messaging necessary to (a) ship data among nodes and storage; and (b) synchronize the nodes in the cluster. The trick is to move the messaging outside of the transaction so it doesn’t impact performance. The way to achieve that is to exploit locality. Let me explain.

When using a shared-disk database, if your application or load balancer just randomly sprays the database requests to any node in the cluster, all of the nodes end up sharing all of the data. This involves a lot of data shipping between nodes and messaging to keep track of which node has what data and what they have done to it. This is at the core of the challenge for companies like ours to build shared-disk databases…it ain’t easy. There are many things you can do to optimize performance in such a scenario like local caching, shared cache (we use CAS, Oracle uses CacheFusion), etc. However, the bottom line is that even with these optimizations, random distribution of database requests results in suboptimal database performance for some scenarios.

Once you have solved the worst case scenario of random database requests, you can start optimizing for the intelligent routing of database requests. By this I mean that either the application or the load balancer sends specific database requests to specific nodes in the cluster. Intelligent database request routing results in something we in the shared-database world call locality. The database nodes are able to operate on local data while only updating the rest of the cluster asynchronously. In this scenario, the database nodes, which are still using a shared-disk architecture, operate much more independently, like shared-nothing. As a result, data shipping and messaging are almost completely eliminated, resulting in performance comparable to shared-nothing, while still maintaining the advantages of shared-disk.

The trick is for the database to recognize on-the-fly when the separate nodes can and cannot operate in this independent fashion. This is complicated by the fact that the database must recognize and adapt to locality which can evolve as database usage changes, nodes are added or removed, etc. This is one aspect of the secret sauce that is built into ScaleDB.

Note: Now that we’ve built a shared-disk database that can recognize locality and respond by acting (and performing) like a shared-nothing database, how do we achieve locality? There are many ways to achieve locality. It can be built into the application, or you can rely on a SQL-aware routing/caching solution like those available from Netscaler and Scalarc that handle this for you.

Monday, March 14, 2011

The CAP Theorem Event Horizon

The CAP Theorem has become a convenient excuse for throwing data consistency under the bus. It is automatically assumed that every distributed system falls prey to CAP and therefore must sacrifice one of the three objectives, with consistency being the consistent fall guy. This automatic assumption is simply false. I am not debating the validity of the CAP Theorem, but instead positing that the onset of CAP limitations—what I call the CAP event horizon—does not start as soon as you move to a second master database node. Certain approaches can, in fact, extend the CAP event horizon.

Physics tells us that different properties apply at different scales. For example, quantum physics displays properties that do not apply at larger scale. We see similar nuances in scaling databases. For example, if you are running a master slave database, using synchronous replication with a single slave is no problem. Add nine more slaves and it slows the system. Add another ninety slaves and you have a real problem with synchronous replication. In other words, consistency at small scale is no problem, but at large scale it becomes impossible, because your latency goes parabolic.
If you break the database into master (read/write) and slave (read-only) functions. You can operate a handful of slaves using synchronous replication without crossing the CAP event horizon, at least among the slaves. However, the master does present a SPOF (single point of failure), undermining availability.

Using a shared-nothing architecture, as soon as you introduce more than a single master, you hit the CAP event horizon. However, shared-disk / shared-cache systems like Oracle RAC and ScaleDB extend the CAP event horizon. They don’t invalidate the CAP event horizon, they merely extend it, by addressing the CAP issues while maintaining low latency.

Shared-disk databases enable multiple nodes to share the same data. Internal processes ensure that the data remains consistent across all nodes, while providing availability and partition tolerance. These processes entail a certain overhead from inter-nodal messaging. There are techniques that can be applied to dramatically reduce this inter-nodal messaging, resulting in a system that delivers the advantages of shared-disk, while delivering a performance profile rivaling shared-nothing, but that will have to wait for a later post.

While shared-nothing databases cross the CAP event horizon as soon as you add a second master, shared-disk databases extend this event horizon well into a handful of database nodes. Optimizations can further extend this eventuality to address dozens of database nodes (all masters). As you move to web scale applications, you will certainly cross the CAP event horizon, but most OLTP type applications can operate quite effectively on ten or fewer database servers, and in that case there is no need to throw consistency under the bus, solely in the name of the CAP Theorem.

More details on the CAP Theorem here, here and here.

Friday, September 17, 2010

ScaleDB Cache Accelerator Server (CAS): A Game Changer for Clustered Databases

ScaleDB and Oracle RAC are both clustered databases that use a shared-disk architecture. As I have mentioned previously, they both actually share data via a shared cache, so it might be more appropriate to call them shared-cache databases.

Whether it is called shared-disk or shared-cache, these databases must orchestrate the sharing of a single set of data amongst multiple nodes. This introduces two challenges: the physical sharing of the data and the logical sharing of the data.

Physical Sharing:
Raw storage is meant to work on a 1:1 basis with a single server. In order to share that data amongst multiple servers, you need either a Network File System (NFS), which shares whole files, or a Cluster File System (CFS), which shares data blocks.

Logical Sharing:
This is specific to databases. A database may request a single block of data from the storage and then it may coordinate multiple sequential changes to that block, with only the final results being written back to the storage. The database can also discriminate between reading the data and writing the data, to facilitate parallelizing these actions.

Databases must control the logical sharing of data, in order to ensure that the database doesn’t become corrupted or inconsistent, and to ensure that it provides good performance. Because logical sharing is very specific to the database, it is something that clustered databases must handle themselves. This function is addressed by a lock manager.

Physical sharing of data requires less integration with the database logic. As such, you can use a general-purpose NFS or a CFS to provide the physical file sharing capabilities. This is what Oracle RAC does, they rely upon Oracle Cluster File System 2 (OCFS2) to provide generic physical file sharing. OCFS2 then relies upon a SAN or NAS that supports multi-attach, since all of the database nodes must share the same physical files. The NAS or SAN then handles the data duplication for high availability and other services like back-up and more.

ScaleDB takes a different approach. ScaleDB not only handles the logical data sharing—with its lock manager—but it also handles the physical data sharing with its Cache Accelerator Server (CAS). CAS connects directly to the storage and handles the sharing of that data among the database nodes. Because CAS is purpose-built for the ScaleDB database it does not need services such as membership management, which create complexity and overhead in a general purpose CFS. Furthermore, ScaleDB is able to tune the CAS, in conjunction with the lock manager, to extract superior performance.

CAS also offers additional benefits. It provides a scalable shared-cache that enables the database nodes to share via the cache, which is much faster than sharing via the disk. Furthermore, since it eliminates the need for an NFS or CFS, it enables you to work with any storage. You can choose to use local storage—inside the CAS—cloud storage, or a SAN or NAS. Many in the MySQL community balk at the high cost of SAN storage with fiber channel and switches and high-cost storage. CAS supports low-cost local storage, while providing a seamless path to high-end storage as needed. Furthermore, the CAS are deployed in pairs, so the data is mirrored. Because the data is mirrored, you have redundant storage, even when using local storage inside the servers running CAS. Because it can operate on commodity hardware and because it works with any storage, CAS is ideal for cloud computing.

In summary, clustered databases like Oracle RAC and ScaleDB must implement their own lock managers to manage the logical sharing of data amongst the database nodes. Providing a purpose-built solution for the physical sharing of the data, while not required, does provide some significant advantages over using a general purpose NFS or CFS.

Tuesday, July 20, 2010

Database Architectures & Performance

For decades the debate between shared-disk and shared-nothing databases has raged. The shared-disk camp points to the laundry list of functional benefits such as improved data consistency, high-availability, scalability and elimination of partitioning/replication/promotion. The shared-nothing camp shoots back with superior performance and reduced costs. Both sides have a point.

First, let’s look at the performance issue. RAM (average access time of 200 nanoseconds) is considerably faster than disk (average access time of 12,000,000 nanoseconds). Let me put this 200:12,000,000 ratio into perspective. A task that takes a single minute in RAM would take 41 days in disk. So why do I bring this up?

Shared-Nothing: Since the shared-nothing database has sole ownership of its data—it doesn’t share the data with other nodes—it can operate in the machine’s local RAM, only writing infrequently to disk (flushing the data to disk). This makes shared-nothing databases very fast.

Shared-Disk: Cannot rely on the machine’s local RAM, because every write by one node must be instantly available to the other nodes, to ensure that they don’t use stale data and corrupt the database. So instead of relying on local RAM, all write transactions must be written to disk. This is where the 1 minute to 41 days ratio above comes into play and kills performance of shared-disk databases.

Let’s look at some of the ways databases can utilize RAM instead of disk to improve performance:

Read Cache: Databases typically use the RAM as a fast read cache. Upon reading data from the disk, this data is stored in the read cache so that subsequent use of that data is satisfied from RAM instead of the disk. For example, upon reading a person’s name from disk, that name is stored in the cache for fast access. The database wouldn’t need to read that name from disk again until that person’s name is changed (rare), or that RAM space is reused for a piece of data that is used more frequently. Read cache can significantly improve database performance.

BOTH shared-disk and shared-nothing databases can exploit read cache. The shared-disk database just needs a system to either invalidate or update the data in read cache when one of the nodes has made a change. This is pretty standard in shared-disk databases.

Background Writing: Writing data to the disk is by far the most time consuming process in a write transaction. During the transaction, that portion of the data is locked, meaning it is unavailable for other functions. So, if you can move the writing of the data outside of the transaction—write the data in the background—you get faster transactions, which means less locking contention, which means faster throughput.

SHARED-NOTHING can exploit this performance enhancement, since each server owns the data in its RAM. However, shared-disk databases cannot do this because they need to share that updated data with the other database nodes in the cluster. Since the local node’s cache is not shared, in a shared-disk database, the only option is to use the shared disk to share that data across the nodes.

Transactional Cache: The next step in utilizing RAM instead of disk is to use it in a transactional manner. This means that the database can make multiple changes to data in RAM prior to writing the final results to disk. For example, if you have 100 widgets, you can store that inventory count in RAM, and then decrement it with each sale. If you sell 23 widgets, then instead of writing each transaction to disk, you update it in RAM. When you flush this data to disk, it results in a single disk write, writing the inventory number 77, instead of writing each of the 23 transactions individually to disk.

SHARED-NOTHING can perform transactions on data while it is in RAM. Once again, shared-disk databases cannot do this because you might have multiple nodes updating the inventory. Since they cannot look into each others local RAM, they must once again write each transaction to disk.

As you can see, shared-nothing databases have an inherent performance advantage. The next blog post will address how modern shared-disk databases address these performance challenges.

Friday, November 27, 2009

Virtual Databases: The Face of the New Cloud Database

Shared-disk databases can be virtualized—making them cloud-friendly—while shared-nothing databases are tied to a specific computer and a specific data set or data partition.

The underlying principle of the shared-nothing RDBMS is that a single master server owns its specific set of data. That data is not shared, hence the name shared-nothing. Because there is no ability to share the data, there is also no ability to virtualize the computing of that data. Instead the shared-nothing RDBMS ties the data and the computing to a specific computer. This association with a physical machine is then reinforced at the application level. Applications leveraging a shared-nothing database, that is partitioned across more than one server, use routing code. Routing code simply directs the various database requests to the servers that own the data being requested. In other words, the application must know which server owns which piece of data. This further reinforces the mismatch between shared-nothing databases and virtualization.

This is not to say that it is impossible to virtualize a shared-nothing database. As any software architect will tell you, “You can do anything in software…” The second part of that statement is “…but it may not perform or scale well, and it may make maintenance very painful.” The latter part of that statement is exactly what you will find with any effort to virtualize a shared-nothing database. Attempts to insert layers of indirection will result in added complexity that makes maintenance a nightmare. Finding bugs, tuning performance, recovering from failure, all of these issues are severely compounded when you introduce layers of indirection in a shared-nothing database.

The performance, and hence the scalability are also undermined in this model. In order to support dynamic virtualization, you must mediate the requests from the application before they hit the database. This requires a piece of middleware that sniffs each database request and routes it to the appropriate server. What happens when a database request spans multiple servers? Suffice it to say it isn’t pretty, and it doesn’t perform well. This sort of request will result in a lot of data shipping and joins. The bottom line is that partitioning your database to achieve performance, scalability and maintenance is a black art, all attempts to automate this process have failed.

Compare this to the shared-disk DBMS. Shared-disk separates the compute from the storage. The data is stored in one big trough, while you can have any number of compute instances feeding on the entirety of that data. Because each node has access to all of the data, you don't need any middleware to route the database requests to specific servers. Furthermore, each of the compute nodes is identical, making them virtualization-friendly. If one node fails, the others recover the transactions, while the application continues uninterrupted. You can also add nodes on the fly, again without interrupting the application. For these reasons, the shared-disk RDBMS is ideal for virtualization, while the shared-nothing RDBMS is anathema to virtualization.

This is an excerpt from a white paper I'm writing that addresses virtualized cloud databases.