Showing posts with label performance. Show all posts
Showing posts with label performance. Show all posts

Monday, July 25, 2011

ScaleDB: Shared-Disk / Shared-Nothing Hybrid

The primary database architectures—shared-disk and shared-nothing—each have their advantages. Shared-disk has functional advantages such as high-availability, elasticity, ease of set-up and maintenance, eliminates partitioning/sharding, eliminates master-slave, etc. The shared-nothing advantages are better performance and lower costs. What if you could offer a database that is a hybrid of the two; one that offers the advantages of both. This sounds too good to be true, but it is fact what ScaleDB has done.

The underlying architecture is shared-disk, but in many situations it can operate like shared-nothing. You see the problems with shared-disk arise from the messaging necessary to (a) ship data among nodes and storage; and (b) synchronize the nodes in the cluster. The trick is to move the messaging outside of the transaction so it doesn’t impact performance. The way to achieve that is to exploit locality. Let me explain.

When using a shared-disk database, if your application or load balancer just randomly sprays the database requests to any node in the cluster, all of the nodes end up sharing all of the data. This involves a lot of data shipping between nodes and messaging to keep track of which node has what data and what they have done to it. This is at the core of the challenge for companies like ours to build shared-disk databases…it ain’t easy. There are many things you can do to optimize performance in such a scenario like local caching, shared cache (we use CAS, Oracle uses CacheFusion), etc. However, the bottom line is that even with these optimizations, random distribution of database requests results in suboptimal database performance for some scenarios.

Once you have solved the worst case scenario of random database requests, you can start optimizing for the intelligent routing of database requests. By this I mean that either the application or the load balancer sends specific database requests to specific nodes in the cluster. Intelligent database request routing results in something we in the shared-database world call locality. The database nodes are able to operate on local data while only updating the rest of the cluster asynchronously. In this scenario, the database nodes, which are still using a shared-disk architecture, operate much more independently, like shared-nothing. As a result, data shipping and messaging are almost completely eliminated, resulting in performance comparable to shared-nothing, while still maintaining the advantages of shared-disk.

The trick is for the database to recognize on-the-fly when the separate nodes can and cannot operate in this independent fashion. This is complicated by the fact that the database must recognize and adapt to locality which can evolve as database usage changes, nodes are added or removed, etc. This is one aspect of the secret sauce that is built into ScaleDB.

Note: Now that we’ve built a shared-disk database that can recognize locality and respond by acting (and performing) like a shared-nothing database, how do we achieve locality? There are many ways to achieve locality. It can be built into the application, or you can rely on a SQL-aware routing/caching solution like those available from Netscaler and Scalarc that handle this for you.

Friday, September 17, 2010

ScaleDB Cache Accelerator Server (CAS): A Game Changer for Clustered Databases

ScaleDB and Oracle RAC are both clustered databases that use a shared-disk architecture. As I have mentioned previously, they both actually share data via a shared cache, so it might be more appropriate to call them shared-cache databases.

Whether it is called shared-disk or shared-cache, these databases must orchestrate the sharing of a single set of data amongst multiple nodes. This introduces two challenges: the physical sharing of the data and the logical sharing of the data.

Physical Sharing:
Raw storage is meant to work on a 1:1 basis with a single server. In order to share that data amongst multiple servers, you need either a Network File System (NFS), which shares whole files, or a Cluster File System (CFS), which shares data blocks.

Logical Sharing:
This is specific to databases. A database may request a single block of data from the storage and then it may coordinate multiple sequential changes to that block, with only the final results being written back to the storage. The database can also discriminate between reading the data and writing the data, to facilitate parallelizing these actions.

Databases must control the logical sharing of data, in order to ensure that the database doesn’t become corrupted or inconsistent, and to ensure that it provides good performance. Because logical sharing is very specific to the database, it is something that clustered databases must handle themselves. This function is addressed by a lock manager.

Physical sharing of data requires less integration with the database logic. As such, you can use a general-purpose NFS or a CFS to provide the physical file sharing capabilities. This is what Oracle RAC does, they rely upon Oracle Cluster File System 2 (OCFS2) to provide generic physical file sharing. OCFS2 then relies upon a SAN or NAS that supports multi-attach, since all of the database nodes must share the same physical files. The NAS or SAN then handles the data duplication for high availability and other services like back-up and more.

ScaleDB takes a different approach. ScaleDB not only handles the logical data sharing—with its lock manager—but it also handles the physical data sharing with its Cache Accelerator Server (CAS). CAS connects directly to the storage and handles the sharing of that data among the database nodes. Because CAS is purpose-built for the ScaleDB database it does not need services such as membership management, which create complexity and overhead in a general purpose CFS. Furthermore, ScaleDB is able to tune the CAS, in conjunction with the lock manager, to extract superior performance.

CAS also offers additional benefits. It provides a scalable shared-cache that enables the database nodes to share via the cache, which is much faster than sharing via the disk. Furthermore, since it eliminates the need for an NFS or CFS, it enables you to work with any storage. You can choose to use local storage—inside the CAS—cloud storage, or a SAN or NAS. Many in the MySQL community balk at the high cost of SAN storage with fiber channel and switches and high-cost storage. CAS supports low-cost local storage, while providing a seamless path to high-end storage as needed. Furthermore, the CAS are deployed in pairs, so the data is mirrored. Because the data is mirrored, you have redundant storage, even when using local storage inside the servers running CAS. Because it can operate on commodity hardware and because it works with any storage, CAS is ideal for cloud computing.

In summary, clustered databases like Oracle RAC and ScaleDB must implement their own lock managers to manage the logical sharing of data amongst the database nodes. Providing a purpose-built solution for the physical sharing of the data, while not required, does provide some significant advantages over using a general purpose NFS or CFS.

Monday, August 2, 2010

Database Architectures & Performance II

As described in the prior post, the shared-disk performance dilemma is simple:

1. If each node stores/processes data in memory, versus disk, it is much faster.
2. Each node must expose the most recent data to the other nodes, so those other nodes are not using old data.

In other words, #1 above says flush data to disk VERY INFREQUENTLY for better performance, while #2 says flush everything to disk IMMEDIATELY for data consistency.

Oracle recognized this dilemma when they built Oracle Parallel Server (OPS), the precursor to Oracle Real Application Cluster (RAC). In order to address the problem, Oracle developed Cache Fusion.

Cache fusion is a peer-based shared cache. Each node works with a certain set of data in its local cache, until another node needs that data. When one node needs data from another node, it requests it directly from the cache, bypassing the disk completely. In order to minimize this data swapping between the local caches, RAC applications are optimized for data locality. Data locality means routing certain data requests to certain nodes, thereby enjoying a higher cache hit ratio and reducing data swapping between caches. Static data locality, built into the application, severely complicates the process of adding/removing nodes to the cluster.

ScaleDB encountered the same conflict between performance and consistency (or RAM vs. disk). However, ScaleDB’s shared cache was designed with the cloud in mind. The cloud imposes certain additional design criteria:

1. The number of nodes will increase/decrease in an elastic fashion.
2. A large percentage of MySQL users will require low-cost PC-based storage

Clearly, some type of shared-cache is imperative. Memcached demonstrates the efficiency of utilizing a separate cache tier above the database, so why not do something very similar beneath the database (between the database nodes and the storage)? The local cache on each database node is the most efficient use of cache, since it avoids a network hop. The shared cache tier, in ScaleDB’s case, the Cache Accelerator Server (CAS), then serves as a fast cache for data swapping.

The Cluster Manager coordinates the interactions between nodes. This includes both database locking and data swapping via the CAS. Nodes maintain the data in their local cache until that data is required by another node. The Cluster Manager then coordinates that data swapping via the CAS. This approach is more dynamic, because it doesn’t rely on a prior knowledge about the location of that data. This enables ScaleDB to support dynamic elasticity of database nodes, which is critical for cloud computing.

The following diagram describes how ScaleDB’s Cache Accelerator Server (CAS) is implemented.

While the diagram above shows a variety of physical servers, these can be virtual servers. An entire cluster, including the lock manager, database nodes and CAS could be implemented on just two physical servers.

Both ScaleDB and Oracle rely upon a shared cache to improve performance, while maintaining data consistency. Both approaches have their relative pros and cons. ScaleDB’s tier-based approach to shared cache is optimized for cloud environments, where dynamic elasticity is important. ScaleDB’s approach also enables some very interesting advantages in the storage tier, which will be enumerated in subsequent posts.

Tuesday, July 20, 2010

Database Architectures & Performance

For decades the debate between shared-disk and shared-nothing databases has raged. The shared-disk camp points to the laundry list of functional benefits such as improved data consistency, high-availability, scalability and elimination of partitioning/replication/promotion. The shared-nothing camp shoots back with superior performance and reduced costs. Both sides have a point.

First, let’s look at the performance issue. RAM (average access time of 200 nanoseconds) is considerably faster than disk (average access time of 12,000,000 nanoseconds). Let me put this 200:12,000,000 ratio into perspective. A task that takes a single minute in RAM would take 41 days in disk. So why do I bring this up?

Shared-Nothing: Since the shared-nothing database has sole ownership of its data—it doesn’t share the data with other nodes—it can operate in the machine’s local RAM, only writing infrequently to disk (flushing the data to disk). This makes shared-nothing databases very fast.

Shared-Disk: Cannot rely on the machine’s local RAM, because every write by one node must be instantly available to the other nodes, to ensure that they don’t use stale data and corrupt the database. So instead of relying on local RAM, all write transactions must be written to disk. This is where the 1 minute to 41 days ratio above comes into play and kills performance of shared-disk databases.

Let’s look at some of the ways databases can utilize RAM instead of disk to improve performance:

Read Cache: Databases typically use the RAM as a fast read cache. Upon reading data from the disk, this data is stored in the read cache so that subsequent use of that data is satisfied from RAM instead of the disk. For example, upon reading a person’s name from disk, that name is stored in the cache for fast access. The database wouldn’t need to read that name from disk again until that person’s name is changed (rare), or that RAM space is reused for a piece of data that is used more frequently. Read cache can significantly improve database performance.

BOTH shared-disk and shared-nothing databases can exploit read cache. The shared-disk database just needs a system to either invalidate or update the data in read cache when one of the nodes has made a change. This is pretty standard in shared-disk databases.

Background Writing: Writing data to the disk is by far the most time consuming process in a write transaction. During the transaction, that portion of the data is locked, meaning it is unavailable for other functions. So, if you can move the writing of the data outside of the transaction—write the data in the background—you get faster transactions, which means less locking contention, which means faster throughput.

SHARED-NOTHING can exploit this performance enhancement, since each server owns the data in its RAM. However, shared-disk databases cannot do this because they need to share that updated data with the other database nodes in the cluster. Since the local node’s cache is not shared, in a shared-disk database, the only option is to use the shared disk to share that data across the nodes.

Transactional Cache: The next step in utilizing RAM instead of disk is to use it in a transactional manner. This means that the database can make multiple changes to data in RAM prior to writing the final results to disk. For example, if you have 100 widgets, you can store that inventory count in RAM, and then decrement it with each sale. If you sell 23 widgets, then instead of writing each transaction to disk, you update it in RAM. When you flush this data to disk, it results in a single disk write, writing the inventory number 77, instead of writing each of the 23 transactions individually to disk.

SHARED-NOTHING can perform transactions on data while it is in RAM. Once again, shared-disk databases cannot do this because you might have multiple nodes updating the inventory. Since they cannot look into each others local RAM, they must once again write each transaction to disk.

As you can see, shared-nothing databases have an inherent performance advantage. The next blog post will address how modern shared-disk databases address these performance challenges.