Database shards
===============

Traditional RDBMS are designed and optimized to function effectively 
   on tables up to a certain size (the exact size varies based on
   the specific database software, underlying hardware, query types
   to be supported, and system configuration)

Many current applications need to operate on scales far beyond the
   effectiveness of the traditional RDBMS, so alternatives were needed.

One of the most successful techniques for dramatically scaling a database
   is the use of "shards": a form of horizontal partioning.

This is the underlying concept for the database systems used by
   Google, Amazon, Facebook, Skype, Wikipedia, flickr, etc etc.

General idea
------------
 - given a database composed of a number of tables, of which a small 
   subset are the 'troublesome' ones with respect to scaling,
   we replicate the entire database many times (running each new copy
   on its own hardware) but divide the rows in the 'problem' tables
   across the copies
 - each copy, or shard, thus has a small subset of the problem table
   entries, plus all the necessary content from the other supporting
   tables
 - a query run against any individual shard runs much faster than it
   would against the whole original version, but might not come back
   with all the rows needed to answer the query since some of those
   rows might actually be in other shards
 - thus we add another layer to the query process, where we pass the
   query on to all the shards we deem necessary, they run the queries
   independently on their own data, and once we get all the answers
   back we compose the final answer to the original query

Effects
-------
 - queries run much faster, since they are being parallelized across
   the shards
 - much more hardware is needed, since we're replicating substantial
   portions of the database across each of the shards

Dividing the data across the shards
-----------------------------------
We need to decide which rows from the various tables belong in each
   shard, and each shard needs enough information to be internally
   self sufficient.  Thus a shard might not need *all* the rows from
   the supporting tables, but might need a substantial portion of them.
   (This is often referred to as 'nothing shared', meaning no shard
    needs to query another shard to come up with a response.)

If there are multiple 'problem' tables in the database, then we need 
   a correct subset of each to handle any/all queries directed made
   against that database.

A significant aspect of this is identifying precisely the query types
   we will support - systems using sharding often support a narrower
   range of queries than a traditional RDBMS might, but support that
   query range on a much larger data set.

If we pick one table as the focal point for dividing the data and pick 
   the rows from that table for the current shard, then based on the
   data in those rows we can identify which rows from the other tables
   need to be supported.

The rows chosen might be based on some application specific knowledge
   (e.g. divided by product type, by region, by account type, by ...?).

   One difficulty with this is that the query volume against a specific
   shard might experience tremendous spikes at various times (e.g. during
   business hours for that region, at times of peak user interest in a
   particular product, etc).  This makes load balancing across the shards
   much more difficult.

   Another difficulty is that it may be difficult to accurately predict
   growth in the shards divided this way, or to subdivide the shards when
   more scaling is needed.

An alternative is to randomly assign rows to shards, e.g. each shard gets
   a random 1% of all rows from the source table.  To do this we need to
   ensure there are enough shards to give a 'good enough' chance that
   every row winds up in at least one shard.

   This means we use *far* more shards (hence much more hardware) than the
   other approach.

   With this approach, however, our query handlers can also pick a random 
   subset of shards to ask for any given query, as long as the subset is
   large enough to give a 'good enough' chance of containing all the rows
   we'll need to formulate an answer.

   This adds substantial fault tolerance to our system, as failures in
   individual shards are (with high probability) masked by the redundant
   data stored in other shards.

   This also does a good job of load balancing, since each shard is equally
   likely to be asked regardless of how much load there actually is.

   This is also more easily scalable, since we just alter the number of 
   shards overall, the number of shards asked with respect to any given
   query, and the probability of any row being stored in a shard.