Data and integrity 
==================

   - vast majority of systems with reasonable data needs
     begin with an rdbms:  e.g. mysql, sql server, oracle, etc

   - (relatively) cheap, stable, flexible, scalable (to a point)

   - provides ability to develop a clean modular data storage format
       - divide overall system into collection of databases
       - divide databases into tables of related information,
           each table defining the data fields or attributes 
           associated with a specific logical entity,
           with (foreign) keys between the tables representing
           logical relationships between the entities
       - implement tables as a 2D scheme of columns (the data fields) 
           and rows (the data records or entries)
       - provide keys for accessing individual entries/rows/records
         and foreign keys for representing the logical relationships
         between the tables/logical entities
       - ability to define data types, sizes/ranges, and default values
         for individual fields
 
   - provide data integrity protection in the form of atomicity,
        transaction and error logs, backup and recovery mechanisms,
        etc

   - provide security mechanisms in terms of usernames, roles,
        passwords, encryption, selected access type to various
        levels of data: specificc databases, tables, fields,
        access type (e.g. read/update/delete/create/add) as
        well as abilities such as granting/denying permissions
        to others, ability to monitor db access stats, logs,
        and performance information
 
Stage 1: Initial design
=======================

   - look for a clean, modular, maintainable design
   - identify logical decomposition of data system into entities (tables)
     and relationships (keys)
   - break up the tables further to normalize the data,
     reducing data redundancy (hence saving space),
     and also vastly improving maintainability 
         (akin to avoiding duplicate code)
   - identify what data processing should take place within the db,
     and what will be handled 'outside'
   - identify and build core query types in SQL, or provide
     sanitization mechanism allowing users to compose their own
   - identify which fields should be indexed to allow faster queries
     (either for queries that are run a lot, or queries that take a
      lot of processing power when they do run)

Stage 2: Tweaking
=================

   - typically begins with monitoring system
       - watch overall utilization of your cpu, memory, bandwidth,
         and disk space -- looking for trends that indicate when you'll
         need to upgrade any of them
       - observe what queries/commands actually get run on your system,
         how often each is run, how long they take to respond,
         how much data they produce, how much cpu time they consume, etc
       - watch live performance (alerts for when cpu or bandwidth maxed
         for specific time periods)
       - watch transaction and error logs, use other performance tools
         to catch the query types/data ranges that are timing out,
         watch for failed/dropped connections, failed queries,
         attempted access violations, etc

   - simple options to deal with observed issues include
       - adding more indexes to speed specific query types
       - doing some denormalization of data (joining some tables and
         replicating data between them) to speed specific query types
       - clustering data within specific tables to cut down on page faults
       - batch processing of updates/inserts/deletes, cuts down on
         rebuilding of indexes
       - upgrade hardware (cpu, memory/caches, drives/caches, network)
       - add or redesign SQL code for specific query types
       - improve pre/post processing of data (i.e. have the apps that
         send the queries or process the results do more of the work)

Stage 3: Scaling
================

   - when tweaking is clearly not going to be enough,
     data needs outstripping the system you have in place,
     what are some options without totally abandoning your design?

   - mirroring your db / load balancing
      - have one 'master', where all updates take place
      - have multiple mirrors, each identical to the master
           - each mirror handles queries that do not involve
             changes to the data
           - each mirror is responsible for watching the master
             to see when its data changes, then ASAP replicates
             the change in its own data
      - have a load balancer, catching incoming queries and
        rerouting them to either the master or a mirror

   - regionalizing your system
      - somewhat similar to mirroring system above, but perhaps
        no longer a single master: have multiple masters, one per
        region, and users are routed to their local master
      - if global data is needed then this requires additional
        code to synchronize/link the masters together
      - benefits in terms of connection times (distance) as
        well as load reduction/distribution
      - 'region' does not have to mean geographic, can come up
        with other schemes to categorize users and assign them
        an appropriate master

   - grossly denormalizing: redesigning both tables and databases
     to have entirely different systems customized for specific
     types of query
      - maintain notion of a master db, representing 'true'
        internal state of the data, but now the mirrors are
        not identical with one another
      - mirrors each maintain (a subset of) the data in their own
        specialized format, and when a change takes place in the 
        master the mirrors need to make any corresponding changes 
        in their own tables/data/indexes etc
      - much more complex logic, but allows you to build different
        mirrors to specialize in specific query types, letting you
        create much more specific SQL, indexing, etc focused entirely
        on that specific query type
      - applications or routing systems now need to recognize the
        query type and route to an appropriate mirror to handle it

   - combinations of the three: can happily mix and match,
     as long as you're willing to cope with the design complexity


Stage 4: Redesign
=================

   - what do(did) companies like Amazon do when they realized they
     just couldn't scale a traditional RDBMS to do what they needed?

   - much of the rest of the course will look at how the data handling
     drives the rest of the system design