Data and integrity ================== - vast majority of systems with reasonable data needs begin with an rdbms: e.g. mysql, sql server, oracle, etc - (relatively) cheap, stable, flexible, scalable (to a point) - provides ability to develop a clean modular data storage format - divide overall system into collection of databases - divide databases into tables of related information, each table defining the data fields or attributes associated with a specific logical entity, with (foreign) keys between the tables representing logical relationships between the entities - implement tables as a 2D scheme of columns (the data fields) and rows (the data records or entries) - provide keys for accessing individual entries/rows/records and foreign keys for representing the logical relationships between the tables/logical entities - ability to define data types, sizes/ranges, and default values for individual fields - provide data integrity protection in the form of atomicity, transaction and error logs, backup and recovery mechanisms, etc - provide security mechanisms in terms of usernames, roles, passwords, encryption, selected access type to various levels of data: specificc databases, tables, fields, access type (e.g. read/update/delete/create/add) as well as abilities such as granting/denying permissions to others, ability to monitor db access stats, logs, and performance information Stage 1: Initial design ======================= - look for a clean, modular, maintainable design - identify logical decomposition of data system into entities (tables) and relationships (keys) - break up the tables further to normalize the data, reducing data redundancy (hence saving space), and also vastly improving maintainability (akin to avoiding duplicate code) - identify what data processing should take place within the db, and what will be handled 'outside' - identify and build core query types in SQL, or provide sanitization mechanism allowing users to compose their own - identify which fields should be indexed to allow faster queries (either for queries that are run a lot, or queries that take a lot of processing power when they do run) Stage 2: Tweaking ================= - typically begins with monitoring system - watch overall utilization of your cpu, memory, bandwidth, and disk space -- looking for trends that indicate when you'll need to upgrade any of them - observe what queries/commands actually get run on your system, how often each is run, how long they take to respond, how much data they produce, how much cpu time they consume, etc - watch live performance (alerts for when cpu or bandwidth maxed for specific time periods) - watch transaction and error logs, use other performance tools to catch the query types/data ranges that are timing out, watch for failed/dropped connections, failed queries, attempted access violations, etc - simple options to deal with observed issues include - adding more indexes to speed specific query types - doing some denormalization of data (joining some tables and replicating data between them) to speed specific query types - clustering data within specific tables to cut down on page faults - batch processing of updates/inserts/deletes, cuts down on rebuilding of indexes - upgrade hardware (cpu, memory/caches, drives/caches, network) - add or redesign SQL code for specific query types - improve pre/post processing of data (i.e. have the apps that send the queries or process the results do more of the work) Stage 3: Scaling ================ - when tweaking is clearly not going to be enough, data needs outstripping the system you have in place, what are some options without totally abandoning your design? - mirroring your db / load balancing - have one 'master', where all updates take place - have multiple mirrors, each identical to the master - each mirror handles queries that do not involve changes to the data - each mirror is responsible for watching the master to see when its data changes, then ASAP replicates the change in its own data - have a load balancer, catching incoming queries and rerouting them to either the master or a mirror - regionalizing your system - somewhat similar to mirroring system above, but perhaps no longer a single master: have multiple masters, one per region, and users are routed to their local master - if global data is needed then this requires additional code to synchronize/link the masters together - benefits in terms of connection times (distance) as well as load reduction/distribution - 'region' does not have to mean geographic, can come up with other schemes to categorize users and assign them an appropriate master - grossly denormalizing: redesigning both tables and databases to have entirely different systems customized for specific types of query - maintain notion of a master db, representing 'true' internal state of the data, but now the mirrors are not identical with one another - mirrors each maintain (a subset of) the data in their own specialized format, and when a change takes place in the master the mirrors need to make any corresponding changes in their own tables/data/indexes etc - much more complex logic, but allows you to build different mirrors to specialize in specific query types, letting you create much more specific SQL, indexing, etc focused entirely on that specific query type - applications or routing systems now need to recognize the query type and route to an appropriate mirror to handle it - combinations of the three: can happily mix and match, as long as you're willing to cope with the design complexity Stage 4: Redesign ================= - what do(did) companies like Amazon do when they realized they just couldn't scale a traditional RDBMS to do what they needed? - much of the rest of the course will look at how the data handling drives the rest of the system design