CSC 375 Notes: Data modeling

Data modeling

The process model, through the use of DFDs, showed the flow of data within a process, but it told us few specifics about the nature of the data itself.

Typically, we supplement the process model with a data model - something geared to very precisely describe all the data elements in a system and the relationships between them.

For modeling the data, we use an entity relation diagram, or ERD. As with the DFDs, there is generally a logical model, produced during the analysis phase, and a physical model produced during the design phase.

In creating an ERD, we first group all the information in the system into logical units. For example, if the system was for booking appointments at an auto repair shop, we might decide that the logical units are

appointment: covering the date and time of the appointment, and possibly the scheduled mechanic
customer: covering the client's contact information and billing history with the shop
vehicle: covering the vehicle make, model, service history, etc

In doing this, we are creating an entity name for the logical group of data, and a list of attributes associated with the entity.

We can then show relationships between different entities by drawing labelled lines between them.

Entity

An entity represents some thing in our system, and is shown using an (uppercase) identifier within a rectangle.

We usually subdivide the entities into three subgroups: the associative entities, independent entities, and dependent entities.

Independent entities can exist on their own: e.g. a client entity can exist independent of the other entities we create.

Associative entities typically exist to store information about many-to-many relationships between two other entities. For example, suppose we have an entity to represent the actual problem in a car, another entity to represent a symptom, and we wish to also represent the severity of a symptom. The severity can't really be made an attribute of symptoms, since it's different depending on which problem is causing the symptom. It also can't be made an attribute of the problem, since then all the problem symptoms would have the same severity ... The solution is to make the severity an entity on its own, one which is "attached" to the relationship between the problem and the symptom:

            +--------+
            |SEVERITY|
            +--------+
                |
    +-------+   |     +-------+
    |PROBLEM|---+-----|SYMPTOM|
    +-------+         +-------+

Dependent entities cannot exist without certain other entities. Many associative entities fal into this category: the idea of severity makes no sense if not related to a symptom.

Attribute

An attribute is a property of an entity. The key point with attributes is to only break them down to their "most useful level of detail" - i.e. the finest grain at which their values actually make a difference to system behaviour.

The attribues are usually listed within the entity block, e.g.

     +-----------+
     |APPOINTMENT|
     +-----------+
     | start time|
     | duration  |
     | location  |
     +-----------+

Relationship

A relationship shows the association between two entities, and is indicated using a line between two entities, with a label to indicate the nature of the relationship.

For example, suppose we have a car dealership, and we include one entity called VENDOR and another called CAR:

    +------+
    |VENDOR|
    +------+
       |
       | sells
       |
     +---+
     |CAR|
     +---+

Relationships have cardinality, which may be one-to-one, one-to-many, or many-to-many, indicating how many different types of entity may hold the relationship to other entities. For example, one mother may have many children, but each child has only one mother. The one-to-many relationship is indicated as shown below:

         +------+
         |mother|
         +------+
            |       
            |
           /|\
         +-----+
         |child|
         +-----+

Observe that a single line is used to indicate one, while a "fork" is used to indicate many.

Relationships also have modality, indicating the minimum number of relationships that must exist for the data to be valid. For example, in our auto-repair case we don't include people who have never had an appointment with our shop, i.e. to be a valid client you must have had at least one appointment -- therefore the modality is 1. Typically we are most interested in tracking whether the modality is 0 or 1 (i.e. does the relationship have to hold in at least one case for the data to be valid?). The modality is usually represented by annotating the relationship with a 0 or 1.

Data Design Questions

Some of the questions you may ask yourself when trying to create a data model include:

What types of people, places, things, materials, events, etc. does your business need to maintain information about?
For each type of thing you identified in question 1, what specific items of information does your business need to keep track of? (For instance, you may need to keep track of your employees' S.I.N. numbers, but you probably don't need to keep track of non-work-related facts about them.)
When your business needs to keep track of information on different items of the same type (e.g. different customers, different salespeople, etc), what unique characteristic or combination of characteristics allows you to identify them individually (e.g. an employee number, a customer id, a customer's name+address, etc)?
Where/when is the data created (from the point of view of your business)?
When/where/how is the data used?
How are the various types of item related to one another (from the perspective of your business)?
Are there special cases with any of the data items (either in the treatment, generation, or use of the data)?

(Note: I've expressed these questions in the context of a business system, but similar ideas hold for whatever kind of system you are trying to model.)

Building ERDs

The ERDs are typically established by going through the process model to identify all the relevant entities.

Once the entities have been established we can go through and attach appropriate attributes, and identify relationships between the entities.

This is very much an iterative process: refining our description of exactly what the data elements are, what properties they have, and how they are related.

Some common guidelines when building your ERDs include

entities should have multiple occurrences
avoid including attributes the system does not use
clearly label all entities and relationships (including cardinality and modality)
clearly state any assumptions

The Data Dictionary

Once again, the ERD is a tool to clarify the structure and relationships of data within the system, but the diagram by itself does not convey everything about the data.

We need to include supporting text (often using standard forms for each entity and/or relationship) describing all the relevant details.

It is common practice to supplement your ERDs with a full data dictionary, which is essentially a lookup table with precise information on each data item in the system: its name, where it is produced, where it is used, relationships to other data items, a clear and intuitive text description, and a full set of cross references to appropriate DFDs and ERDs.

Example ERD

The (very simple) ERD below considers the logical data necessary in a prescription billing system. It assumes there are five logical data entities we are interested in: doctors, patients, insurers, treatments, and prescriptions.

In the example shown, each of the relationships is one-to-many (indicated by the 1 or N on the connections representing relationships).

Note from Dave:

the fields below are pretty sparsely described - this might be the ERD after a "first pass". On subsequent passes the descriptions would become much more detailed.

In terms of supporting text for such an ERD, we might include something like:

Overview:

Our system, designed to maintain prescription billing information, models the interaction of patients with their doctors and insurers, and the associated treatments and prescriptions.

The diagram above provides an intuitive view of the model, with further details provided in the descriptions below. Additional information on the data involved can be found in the Data Dictionary (include page number or link).

Entities:

Our system includes five data entities, representing the doctors, their patients, the insurers of the patients, the patients' treatments, and the associated prescriptions. Each entity is described below:

Doctor

ID: ERDEnt001 Description: Doctors are provincially registered, practicing physicians, and by the time of treatment each patient must have one doctor as their named physician. Each doctor can have one or more patients.

Note/oversimplificiation #1: this doesn't handle cases where patients see multiple specialists and get prescriptions from each

Key attribute: Physician's Provincial Registered ID Other attributes: Physician name, address, phone, email, specializations Relationships: Doctor treats Patient Attribute Constraints: Physician's Provincial Registered ID: a 16-digit number Physician's name: a text string Physician's address: a street address, city, country, and postal/zip code Physician's phone number: a country code, area code, local number, and extension Physician's specializations: a text string

Insurer

ID: ERDEnt002 Description: The Insurer is any one of a number of insurance agencies providing patients with cost coverage for medical treatments. The system assumes each patient has one such covering agency, and agencies may provide coverage for many patients.

Note/oversimplificiation #2: this doesn't handle cases where patients have no coverage, or have coverage under multiple plans

Key attribute: Insurance Carrier ID Other attributes: Carrier name, address, phone number Relationships: Insurer insures Patient Attribute Constraints: Insurance Carrier ID: a 32 character alphanumeric string Carrier Name: a text string Carrier Address: a street address, city, country, and postal/zip code Carrier Phone Number: a country code, area code, local number, and extension

Patient

ID: ERDEnt003

Description: The patient is an individual obtaining treatment (involving a prescription) from a Doctor,
             where a portion of the cost may be covered by their Insurer.  Each patient is assumed to
             have one Insurer and one Doctor.

Key attribute: Patient Medical ID

Other attributes: Patient name, address, phone, date, insurance plan type

Relationships: Doctor treats Patient Insurer
               Insurer insures Patient
               Treatment is given to Patient

Attribute Constraints:
        Patient Medical ID: a 10-digit number
        Patient Name: a text string
        Patient Address: a street address, city, country, and postal/zip code
        Patient Phone Number: a country code, area code, local number, and extension
        Date: Month, day, and year (mm/dd/yyyy)
        Insurance Plan Type: a text string (name), plus a 3-digit carrier identifier, 
                             7-digit group identifier, and 6-digit client identifier

Treatment

ID: ERDEnt004

Description: A Treatment is one course of action chosen by a Doctor for a Patient in handling a 
             particular medical issue.  A Patient may be undergoing multiple treatments for different 
             medical conditions, and each treatment may be associated with one or more prescriptions.

Key attribute: Treatment Code ID

Other attributes: Patient name, product name, date, symptoms, description

Relationships: Treatment is given to Patient

Attribute Constraints:
        Left as an exercise for the reader

Prescription

ID: ERDEnt005

Description:  A Prescription is a single script prescribed by a Doctor for a Patient in
              the process of a specific Treatment.

Key attribute: Product ID Code

Other attributes: Product name, strength, dosage, manufacturer, amount

Relationships: Treatment includes Prescription

Attribute Constraints:
        Left as an exercise for the reader

Relationships:

Our system includes four relationships: a doctor treats a patient, a patient is insured by an insurer, a treatment is given to a patient, and a treatment includes a prescription. Each relationship is described below:

Treats

ID: ERDRel001

Description: Each Doctor treates one or more Patients, each Patient is treated by exactly
             one Doctor.

Entities: Doctor (treats) Patient

Cardinality: One Doctor to one or more Patients

Insures

ID: ERDRel002

Description: Each Insurer insures one or more Patients, each Patient is insured by exactly
             one Insurer.

Entities: Insurer (insures) Patient

Cardinality: One Insurer to one or more Patients

Is given to

ID: ERDRel003

Description: each Treatment is given to one Patient, a Patient may be given one or more
             Treatments.

Entities: Treatment (is given to) Patient

Cardinality: One or more Treatments to one Patient

Includes

ID: ERDRel004

Description: Each Treatment includes one or more Prescriptions, each Prescription is
             included in exactly one Treatment.

Entities: Treatment (includes) Prescription

Cardinality: One Treatment includes one or more Prescriptions

Example ERD

As a more complex example, suppose we were to consider an ERD for campus timetabling and room booking along the following lines:

The college has multiple rooms, faculty members, students, and approved courses.
Each approved course has a unique course name and number, a fixed number of lecture/seminar/lab hours, a credit value, a set of zero or more pre-requisite and co-requisite courses, and a set of room requirements (e.g. does it require specific types of rooms for the labs, lectures, or seminars).
Each semester the college has multiple course offerings, each of which has one or more sections.
Each section has a room, an instructor, and an enrollment limit.
Each room has a type (e.g. lecture room, general computer lab, etc) and a capacity (number of students it can accomodate).
A room can be used for multiple lecture sessions of a course offering at the same time if all are taught by the same instructor, otherwise a room cannot be used for different sections (of any courses) at the same time.
An instructor may teach multiple sections in multiple courses, but cannot teach in two different rooms at the same time.
A student can enroll in multiple course offerings, but only one section of each course and only if they have the pre/co-requisites, and students need to be warned if they have registered in sections with conflicting times.

The system we plan on developing will take care of timetabling all the courses and selecting all the rooms each year after a list of course offerings, sections, and instructors is determined.

For the ERD, we are simply concerned with how to effectively model all the information listed above, so the rest of our system can use it efficiently.