Amazon
======
 entirely linux, running Oracle
 implemented in C++, perl, java, mason (?)
    C++ to handle requests, perl/mason to build pages
 hundreds of services used to build each web page
 world's 3 largest linux databases, 
    over 50 terrabytes capacity between them in 2005
 28 HP servers, 4 CPUs each
 also started out as traditional rdbms,
    forced to go distributed to scale, nothing-shared approach

 small teams focused around specific services
    enthusiasm, creativity, confidence
    innovation comes from the bottom, creativity from everywhere
    developers need freedom and good tools

 get big fast, the existing big guys are on your tail!
 build around your own api,
   and make it available FREE to other developers to encourage
   rapid market penetration and growth

 for high scalability must change your mindset:
    "approach chaos in a probabalistic sense that things will work well"
 build and rely on your own infrastructure
 use your staging environment!
 be ready to rollback in event of failure

Database categories
-------------------
 - general queries (15 TB '05)
 - historical queries (18 TB '05)
 - ETL: extract, transform, and load (5 TB '05)
  
Outsourcing their capacity and technology
-----------------------------------------
 Sell your product lines through amazon
     on warehouse basis, you ship goods to amazon and they handle everything
     on marketplace basis, amazon handles the sales but you ship
        from your own site
   
 Host your own amazonish ecom site
     amazon provides site kit
     you customize for your site and products
     you and amazon split the revenues
 
 Use the amazon cloud for computing capacity
     create a machine image, specifying applications, libraries, data, etc
     select server instances and os, you have root on each
        add new instances 'within minutes'
     identify security choices (firewalls, ip ranges, etc)
     pick regions
 they provide backups/monitoring
     
 Use the amazon simple db
     no scheme, non-relational db
     highly scalable, string data only
     data model is object-like: 
          domains, items, attributes, values
     multiple geographically regionalized sites
  
 Concerns
     relaxed consistency constraints, a read immediately
        after a write might not give updated results
     if you don't control your database how do you
        outperform your competition   
     attributes limited to 1024 bytes, strings only
        means integers must be treated as strings
        (e.g. leftpad with 0s to a common width for sorting)
     amazon has all your data: security? privacy? 
        what if you want to switch later - what happens
        with/to your data?
     

 Use the amazon s3 simple storage system for
     storing and retrieving any number of items, 
       each up to 5 TB in size,
     locations can be regionalized,
     bittorrent or http retrieval,
     good for media distribution, data warehousing, etc
 
 Driven by the amazon dynamo system    
     relatively lightweight business logic, most data accesses
        primary key based, enabling a different db form (not RDBMS)
     highly available key-value store
     a single simple interface to dynamo
     amazon e-sites consist of hundreds of interacting services
        each of which uses that same interface to dynamo
     highly decentralized loosely couped architecture,
        changes are allowed to percolate to replicas in the background
     focuses on always writeable - you write data anytime,
        and do your consistency checks during read operations,
        treating the data as unavailable until you can get the
        consistency desired
        e.g. you should always be able to change your shopping cart
           contents even if part of the network is experiencing
           failures, attacks, congestion, etc
        so as long as you can reach part of it
     searching and shopping carts should continue to function
        even in the presence of tornados etc, so need to exist
        in multiple dispersed data stores
     many services in the sales system need only primary key
        access to tables, not fullblown RDBMS query capacity
     each service has latency and throughput constraints that
        it must meet, sacrificing efficiency/throughput if needed
        (service level agreements, SLAs)
           - client/server agreed upon request rate distribution
             and expected service latency requirements for that dist,
             e.g. response times for a service must be under 300ms for
                  99.9% of requests when load is 500 requests per second
     often over 150 services involved in handling a client request
     data is partitioned and replicated using consistent hashing
         consistency maintained by quorum-like technique
             with decentralized replica synchronization protocol
         supported by object versioning
         fault tolerance by a gossip technique
     little manual administration of system
     internally the system is assumed to be secure and accessed
         only through the API, therefore minimal authentication
         and authorization operations internally
     conflict resolution is handled at the application level,
         allowing it to decide how to handle disagreements
         - assign a total ordering to updates,
           handle them in that order
     incremental scaling - allowing you to add one more node
         at a time 'indefinitely'
     favour peer-to-peer over central organization,
         with locally stored routing information and routing
         algorithms designed to reduce the number of hops
 
Problem:
     partioning handled by consistency hashing, allowing scalability
     write availability handled by vector clocks with reconciliation
        during reads, separating update rates from data set size
     temporary failures handled by sloppy quorums and hinted handoffs
        giving high availability and durability
     failure detection by gossip, avoiding centralized system
     divergent replicas synchronized in background

 The data
     get(key), put(key)
     output range of a hash function is a circular ring,
        used to distribute data across storage
     each item uniquely identified by a key
     state is stored as blobs (binary objects)
     no operations span multiple items
     relaxes consistency constraints to achieve greater scalability
        while maintaining availability
     
     
 The architecture
                client requests
              page rendering components
               request routing
              aggregator services
                request routing
                  services
             dynamo instances, amazon s3, other datastores