Database shards =============== Traditional RDBMS are designed and optimized to function effectively on tables up to a certain size (the exact size varies based on the specific database software, underlying hardware, query types to be supported, and system configuration) Many current applications need to operate on scales far beyond the effectiveness of the traditional RDBMS, so alternatives were needed. One of the most successful techniques for dramatically scaling a database is the use of "shards": a form of horizontal partioning. This is the underlying concept for the database systems used by Google, Amazon, Facebook, Skype, Wikipedia, flickr, etc etc. General idea ------------ - given a database composed of a number of tables, of which a small subset are the 'troublesome' ones with respect to scaling, we replicate the entire database many times (running each new copy on its own hardware) but divide the rows in the 'problem' tables across the copies - each copy, or shard, thus has a small subset of the problem table entries, plus all the necessary content from the other supporting tables - a query run against any individual shard runs much faster than it would against the whole original version, but might not come back with all the rows needed to answer the query since some of those rows might actually be in other shards - thus we add another layer to the query process, where we pass the query on to all the shards we deem necessary, they run the queries independently on their own data, and once we get all the answers back we compose the final answer to the original query Effects ------- - queries run much faster, since they are being parallelized across the shards - much more hardware is needed, since we're replicating substantial portions of the database across each of the shards Dividing the data across the shards ----------------------------------- We need to decide which rows from the various tables belong in each shard, and each shard needs enough information to be internally self sufficient. Thus a shard might not need *all* the rows from the supporting tables, but might need a substantial portion of them. (This is often referred to as 'nothing shared', meaning no shard needs to query another shard to come up with a response.) If there are multiple 'problem' tables in the database, then we need a correct subset of each to handle any/all queries directed made against that database. A significant aspect of this is identifying precisely the query types we will support - systems using sharding often support a narrower range of queries than a traditional RDBMS might, but support that query range on a much larger data set. If we pick one table as the focal point for dividing the data and pick the rows from that table for the current shard, then based on the data in those rows we can identify which rows from the other tables need to be supported. The rows chosen might be based on some application specific knowledge (e.g. divided by product type, by region, by account type, by ...?). One difficulty with this is that the query volume against a specific shard might experience tremendous spikes at various times (e.g. during business hours for that region, at times of peak user interest in a particular product, etc). This makes load balancing across the shards much more difficult. Another difficulty is that it may be difficult to accurately predict growth in the shards divided this way, or to subdivide the shards when more scaling is needed. An alternative is to randomly assign rows to shards, e.g. each shard gets a random 1% of all rows from the source table. To do this we need to ensure there are enough shards to give a 'good enough' chance that every row winds up in at least one shard. This means we use *far* more shards (hence much more hardware) than the other approach. With this approach, however, our query handlers can also pick a random subset of shards to ask for any given query, as long as the subset is large enough to give a 'good enough' chance of containing all the rows we'll need to formulate an answer. This adds substantial fault tolerance to our system, as failures in individual shards are (with high probability) masked by the redundant data stored in other shards. This also does a good job of load balancing, since each shard is equally likely to be asked regardless of how much load there actually is. This is also more easily scalable, since we just alter the number of shards overall, the number of shards asked with respect to any given query, and the probability of any row being stored in a shard.