Amazon ====== entirely linux, running Oracle implemented in C++, perl, java, mason (?) C++ to handle requests, perl/mason to build pages hundreds of services used to build each web page world's 3 largest linux databases, over 50 terrabytes capacity between them in 2005 28 HP servers, 4 CPUs each also started out as traditional rdbms, forced to go distributed to scale, nothing-shared approach small teams focused around specific services enthusiasm, creativity, confidence innovation comes from the bottom, creativity from everywhere developers need freedom and good tools get big fast, the existing big guys are on your tail! build around your own api, and make it available FREE to other developers to encourage rapid market penetration and growth for high scalability must change your mindset: "approach chaos in a probabalistic sense that things will work well" build and rely on your own infrastructure use your staging environment! be ready to rollback in event of failure Database categories ------------------- - general queries (15 TB '05) - historical queries (18 TB '05) - ETL: extract, transform, and load (5 TB '05) Outsourcing their capacity and technology ----------------------------------------- Sell your product lines through amazon on warehouse basis, you ship goods to amazon and they handle everything on marketplace basis, amazon handles the sales but you ship from your own site Host your own amazonish ecom site amazon provides site kit you customize for your site and products you and amazon split the revenues Use the amazon cloud for computing capacity create a machine image, specifying applications, libraries, data, etc select server instances and os, you have root on each add new instances 'within minutes' identify security choices (firewalls, ip ranges, etc) pick regions they provide backups/monitoring Use the amazon simple db no scheme, non-relational db highly scalable, string data only data model is object-like: domains, items, attributes, values multiple geographically regionalized sites Concerns relaxed consistency constraints, a read immediately after a write might not give updated results if you don't control your database how do you outperform your competition attributes limited to 1024 bytes, strings only means integers must be treated as strings (e.g. leftpad with 0s to a common width for sorting) amazon has all your data: security? privacy? what if you want to switch later - what happens with/to your data? Use the amazon s3 simple storage system for storing and retrieving any number of items, each up to 5 TB in size, locations can be regionalized, bittorrent or http retrieval, good for media distribution, data warehousing, etc Driven by the amazon dynamo system relatively lightweight business logic, most data accesses primary key based, enabling a different db form (not RDBMS) highly available key-value store a single simple interface to dynamo amazon e-sites consist of hundreds of interacting services each of which uses that same interface to dynamo highly decentralized loosely couped architecture, changes are allowed to percolate to replicas in the background focuses on always writeable - you write data anytime, and do your consistency checks during read operations, treating the data as unavailable until you can get the consistency desired e.g. you should always be able to change your shopping cart contents even if part of the network is experiencing failures, attacks, congestion, etc so as long as you can reach part of it searching and shopping carts should continue to function even in the presence of tornados etc, so need to exist in multiple dispersed data stores many services in the sales system need only primary key access to tables, not fullblown RDBMS query capacity each service has latency and throughput constraints that it must meet, sacrificing efficiency/throughput if needed (service level agreements, SLAs) - client/server agreed upon request rate distribution and expected service latency requirements for that dist, e.g. response times for a service must be under 300ms for 99.9% of requests when load is 500 requests per second often over 150 services involved in handling a client request data is partitioned and replicated using consistent hashing consistency maintained by quorum-like technique with decentralized replica synchronization protocol supported by object versioning fault tolerance by a gossip technique little manual administration of system internally the system is assumed to be secure and accessed only through the API, therefore minimal authentication and authorization operations internally conflict resolution is handled at the application level, allowing it to decide how to handle disagreements - assign a total ordering to updates, handle them in that order incremental scaling - allowing you to add one more node at a time 'indefinitely' favour peer-to-peer over central organization, with locally stored routing information and routing algorithms designed to reduce the number of hops Problem: partioning handled by consistency hashing, allowing scalability write availability handled by vector clocks with reconciliation during reads, separating update rates from data set size temporary failures handled by sloppy quorums and hinted handoffs giving high availability and durability failure detection by gossip, avoiding centralized system divergent replicas synchronized in background The data get(key), put(key) output range of a hash function is a circular ring, used to distribute data across storage each item uniquely identified by a key state is stored as blobs (binary objects) no operations span multiple items relaxes consistency constraints to achieve greater scalability while maintaining availability The architecture client requests page rendering components request routing aggregator services request routing services dynamo instances, amazon s3, other datastores