Skip to content

3. Data Modeling

Switch to Zen Mode

Data modeling describes how information from a real-world domain is structured so that it can be reliably stored and analyzed in a database. The goal is a clear, consistent structure that enables correct analyses and remains maintainable in the long term.

The individual steps of data modeling are:

  1. Understand the problem: Collect goals, questions, and required information (e.g., for a school library: books, loans, persons). There is a discipline called Requirements Engineering (requirements analysis) for this. The result is a requirements specification or functional specification.

  2. Identify entities: Things/objects with meaning (Book, Student, Teacher). Watch out for homonyms (e.g., “Class” as school class, programming class, or classroom).

  3. Define attributes: Properties of entities (Title, ISBN, Name, Class).

  4. Define relationships and cardinalities: How entities are related (one book — many loans; one loan — exactly one book, exactly one person).

  5. Choose keys: Primary keys must be unique, foreign keys link tables. What makes individual entities uniquely identifiable?

  6. Draw the ER model: Visualize entities, attributes, relationships.

  7. Normalize: Avoid redundancy and anomalies (1NF: atomic values; 2NF/3NF: separate dependencies).

  8. Specify data types and constraints: e.g., NOT NULL, UNIQUE, CHECK.

  9. Derive relational schema: Create table structure from the ER model.

  10. Implement and test: Execute SQL-DDL, insert sample data, check typical queries.

  11. Iteratively improve: Use feedback, adapt model to new requirements — and start over.

When creating a data model, not every single detail of reality is adopted 1:1. Instead, information is organized so that it is meaningful and manageable for a system. Objects from the real world are examined, their properties collected, and then structured. This step is called abstraction.

  • Object: a concrete, individual thing from the real world. Example: A specific car with the license plate XYZ-123, a specific student, etc.

  • Class: a set of similar objects with common characteristics.

    Example: The class “Car” includes all registered cars; the class “Student” includes all individual students.

A class therefore describes what properties its objects have (e.g., manufacturer, model, year of manufacture or first name, last name, date of birth), while an object has concrete values for these properties (e.g., “VW”, “Golf”, “2019” or “Kevin”, “Maier”, “2005-05-15”).

Similar objects with common properties are grouped into classes. Which things in your analysis are the same or very, very similar?

Classification: Michi and Mike are instances of the class 'Dog'.
Fig. 6.1: Michi and Mike are individual representatives or 'instances' of the class 'Dog'.

Example: From individual animals, the class “Mammal” is formed. From individual books, the class “Book” emerges.

A new class is composed of other classes. An object can therefore have parts that are themselves objects of other classes. Can you define “X consists of…”?

Aggregation: Several parts can be identified in a dog.
Fig. 6.2: Several parts can be identified in a dog. It 'consists of' various body parts such as nose, eyes, and ears.

Example: The class “Car” consists of, among other things, engine, body, and wheels. A “Computer” aggregates CPU, memory, and hard drive.

An is-a relationship is established between classes. A more general class summarizes properties that all associated more specific classes share. Example: “Animal” generalizes “Bird”, “Reptile”, and “Mammal”. The shared properties (e.g., “living being”, “breathing”) are in “Animal”. “Bird” adds special characteristics (e.g., “wings”).

Inheritance: Properties of the general class are inherited by the specialized classes, e.g., everything a “Mammal” can do, a “Dog” or “Cat” can also do.

Generalization: 'Mammal' is a generalization of 'Dog'.
Fig. 6.3: 'Mammal' is a generalization of 'Dog'.

Generalization can also span many levels: “Animal” to “Mammal” to “Dog” to “German Shepherd”. The objects at each level become increasingly “precise” (more specialized). While “Animal” is still fairly general and “Mammal” is more specific, “Dog” is more precise still, and “German Shepherd” is very specific.

Generalization: Inferring from individual classes to a general parent class, e.g., from “Dog” and “Cat” to “Mammal”. Specialization: Inferring from a general class to more detailed sub-classes, e.g., from “Mammal” to “Dog” and “Cat”.

Objects or classes can stand in relationship to each other. These relationships can occur between two or more classes and additionally have roles, direction, and cardinalities.

Example: “Student” — “attends” — “School” (many students attend one school; one school has many students). Example: “Customer places Order”, “Teacher teaches Course”.

Associations are usually verbs: “attends”, “orders”, “plays”, “buys”, “receives”, “names”, “flies”, …

Association: Dog 'Mike' barks at cat 'Ferdinand'.
Fig. 6.4: Dog 'Mike' barks at cat 'Ferdinand'.

In fact, generalization is also a special type of association: “is” — and aggregation too: “has”.

So that an object can be uniquely recognized, certain property values are defined as keys.

Identification: Dog 'Mike' is chipped with a unique ID.
Fig. 6.5: Dog 'Mike' is chipped with a unique ID.

Example: ISBN identifies a book uniquely; the abbreviation identifies a person within a school; for cars, the Vehicle Identification Number (VIN) or the license plate can serve this purpose.

  • Classification groups similar things into classes.
  • Aggregation describes “consists of” structures.
  • Generalization/Specialization organizes commonalities and differences via inheritance.
  • Association connects classes through relationships.
  • Identification ensures uniqueness through keys.

These concepts help map complex realities clearly, reusably, and with fewer errors in data models.

The Entity-Relationship Model (ER model, ERM) is a widely used graphical tool for database design. It is also used in other areas of computer science to represent parts of the real world. The ERM is independent of a concrete data model and is not bound to technical limitations of a specific implementation. It was developed in 1976 by Peter Chen. With the ER model, the conceptual designs of a database can be represented clearly; a frequently used notation style is named after its originator “Chen Notation”. However, there are also other notations, e.g., the “Crow’s Foot Notation”.

The basic building blocks of the ER model are:

  • Entities — the things or objects about which information is stored.
  • Relationships — the connections between these entities.
  • Attributes — characteristics that describe both entities and relationships in more detail.

An arbitrary number of individual objects can exist from an entity set or a relationship set.

For extended relationship types, there is the EERM (Extended ER Model). It supplements the classic ERM with, among other things:

  • Aggregation (Part-of) — represents related objects as a whole.
  • Generalization/Specialization (Is-a) — assigns shared properties to a superclass and introduces subclasses for special characteristics.
  • Additional relationship forms that enable more detailed, semantically richer modeling.

The (E)ER model thus supports a clear, understandable, and systematic planning of data structures.

The ER Model According to Peter Chen — Chen Notation

Section titled “The ER Model According to Peter Chen — Chen Notation”

Chen Notation is the first graphical representation of the ER model, which clearly visualizes entities, relationships, and attributes. It uses rectangles for entities, diamonds for relationships, and ovals for attributes. Cardinalities are represented by lines and labels.

Portrait of Peter Chen.
Fig. 6.6: Peter Chen.

In 1976, Peter Chen introduced the ER model in his article “The Entity-Relationship Model — Toward a Unified View of Data”. He describes how one can represent the real world through entities and their relationships with each other. This model forms the foundation for many modern database designs and implementations.

Entity

An entity is a uniquely distinguishable “thing” from the real world. This can be a person, an object, a company, a dog, a room, or a project. Entities differ from each other through their properties (attributes) and their values.

Object and instance are synonyms for entity. An entity is therefore a concrete object with individual attribute values.

Example: “Employee Schmidt”, “Project 1009”, “Department Research”, “Mike the Dog”, “Ferdinand the Cat”.

Entity Type

The entity type describes a class of similar entities — i.e., things that are described by the same attributes. When modeling, one normally works with entity types, not with individual entities.

Graphically, entity types are often represented as rectangles.

Symbol of an entity type according to Peter Chen.
Fig. 6.7: Symbol of an entity type according to Peter Chen. There is a thing from the real world: 'Dog'.

Examples of entity types: Employee, Department, Project, Dog, Cat.

Entity Set

An entity set is the collection of all entities of an entity type at a specific point in time. All elements in it have the same attributes, but different attribute values.

Entity sets can change over time (new entities are added, others are removed).

Examples: All Employees, All Departments, All Projects, All Dogs, All Cats.

Attributes (properties) describe characteristics of entities, entity types, relationships, or relationship types.

Each attribute has:

  • a name (e.g., Date of Birth, Price, Designation) and
  • a value (e.g., 12.03.2008, 19.99, “Mathematics”, “Mike”).

Attributes are represented in Chen Notation as ovals and connected to the entity type with a line.

Symbol of an entity type with attributes according to Peter Chen.
Fig. 6.8: Symbol of an entity type with attributes according to Peter Chen. A dog has a name and a date of birth. Further attributes could be breed or coat color.

The domain defines which values an attribute may take — the value range. This can be:

  • a fixed list of allowed values (e.g., January, February, …),
  • a range (e.g., numbers 0-999 or letters A-G, 1.5-2.5),
  • a set or data type specification (e.g., natural number, real number, date, Boolean).

This makes clear what kind of data is permissible for an attribute and what is not — important for correct and consistent data in the database. The domain describes the value range of an attribute.

  • A key consists of one or more attributes.
  • It should be as short as possible but as long as necessary (minimality principle).

Artificial Key (Surrogate Key)

  • If no existing attribute (or combination of attributes) uniquely identifies, an artificial attribute is added (e.g., a sequential number/ID).
  • This attribute receives a unique value for each entity and serves as the key.

Primary Key

  • The primary key identifies each entity uniquely; its value appears only once in the entity set.
  • The attribute that makes an entity unique is called the identifying attribute.
  • If one attribute is not sufficient, the primary key can be composite (multiple identifying attributes).
  • An entity type can have multiple possible keys (candidate/alternate keys), but exactly one is designated as the primary key.
  • In ER diagrams, the attributes of the primary key are underlined.
Symbol of an entity type with attributes and a primary key attribute according to Peter Chen.
Fig. 6.9: Symbol of an entity type with attributes and a primary key attribute according to Peter Chen. The dog's name alone is not sufficient to identify it uniquely. Something is needed that truly makes each dog unique: the chip ID. — Note: in Austria, dogs are registered and receive a unique number.

Examples (for orientation)

  • Person: Student number (unique) → good primary key.
  • Person: First name + Last name (not unique) → unsuitable; possibly add Date of birth, or better use a Person ID.
  • Product: Article number instead of Designation (because designations can change or appear more than once).

Some attributes can take multiple values (e.g., phone numbers of a person) or consist of multiple parts (e.g., street, house number, postal code, city for an address). In Chen Notation, such attributes are represented as double ovals.

Symbol of an entity with a multivalued attribute according to Peter Chen.
Fig. 6.10: A dog's pedigree is a multivalued attribute, because a dog can have multiple ancestors or it is sufficiently complex.
  • Relationships show interactions or dependencies between entities.
  • Relationships can also have attributes (e.g., role in project, hire date in a “works-on” relationship).

Relationship Set (Association)

  • A relationship set is a collection of similar relationships that links specific entity sets with each other (e.g., all “Employee-works-on-Project” pairs).

Relationship Type

  • The relationship type is — like the entity type — the abstraction: it describes what kind of connection exists between which entity types.
  • In ER diagrams, a relationship type is represented as a diamond.
  • The diamond is connected via edges to the participating entity types.
  • The name of the relationship type can be written in the diamond (e.g., “works_on”, “owns”, “belongs_to”).
Symbol of a relationship between entities in Chen Notation.
Fig. 6.11: Symbol of a relationship between entities in Chen Notation. Dogs have the habit of barking at cats. So they have a relationship with each other.
  • Relationships can have their own attributes, e.g., the activity of a person in a project or the allocation percentage.
Symbol of a relationship between entities with an attribute in Chen Notation.
Fig. 6.12: Symbol of a relationship between entities with an attribute in Chen Notation. Dogs have the habit of barking at cats. However, there are different types of barking, described here as the attribute 'Behavior': from joyful to aggressive.

The number of participating entity types (degree) can also vary. Usually there are 2 entities, but more are also possible: 3 or more.

  • A Product is made up of Components from various Suppliers (3 entities: Product, Component, Supplier: a ternary relationship).

Generalization and specialization are essentially special types of relationships between entity types. To represent these relationships in Chen Notation, a diamond with the name “is” is used. The diamond is connected with lines to the participating entity types.

As an alternative, a triangle has also become established, symbolizing the generalization/specialization relationship. The triangle is connected with lines to the participating entity types, with the tip of the triangle pointing towards the more general entity type.

Symbol of generalization/specialization in Chen Notation.
Fig. 6.13: Symbol of generalization/specialization in Chen Notation — Dog and Cat are both Animals (is-a). Every instance of Dog and every instance of Cat inherits all attributes from the class Animal.

In many cases, entities are independent: they can be uniquely identified within their entity set.

However, there are also entities that are only unique together with the key of a superordinate (dominant) entity type. These are called weak or dependent entities.

Examples:

  • Bank details of a customer with direct debit authorization: If the customer is deleted, the direct debit authorization must also be deleted (dependency).
  • Class of a school: A class only exists in connection with exactly this school; parts of its description depend on attributes of the school.

Representation in diagrams:

  • Weak entities are drawn as double rectangles.

In the literature, other markings can sometimes be found (e.g., double or thick border, double lines, arrows in different directions).

Symbol of a weak entity in Chen Notation.
Fig. 6.14: Symbol of a weak entity in Chen Notation: the instance cannot exist without the superordinate book.

The cardinality defines how many entities of one entity set can be linked to how many entities of another entity set (e.g., how many employees work on one project — and vice versa).

  • Exactly one assignment (1)

    An entity is connected to exactly one entity on the other side.

    Example: Each invoice belongs to exactly one customer(1:1), if this also holds in reverse.

  • One or more assignments (1..n or n)

    An entity is connected to one or more entities on the other side.

    Example: One project has one or more employees(1:n).

  • Many-to-many (m:n)

    Multiple entities on one side can be connected to multiple on the other side.

    Example: Employees work on multiple projects, and projects have multiple employees(m:n).

Symbol of a relationship between entities with n:m cardinalities in Chen Notation.
Fig. 6.15: Symbol of a relationship between entities with n:m cardinalities in Chen Notation. In principle, any number of dogs can bark at any number of cats. So: each dog can bark at many cats, and conversely, each cat can be barked at by many dogs.

Optionality (if needed):

  • 0..1: zero or one assignment (optional, at most one).
  • 0..n: zero, one, or many assignments (optional, any number).
  1. Identify entities: Mark possible entities in the text. Candidates are usually nouns.

    A small animal clinic manages pets belonging to their owners. Each owner can have multiple animals, but an animal belongs to exactly one owner. Each animal is either a dog or a cat. Each visit has a date, a reason for treatment, and an invoice amount. For dogs, breed-specific information and an optional chip number are recorded; for cats, coat color and neutering status are documented.

    If you are unsure whether something is an entity, ask yourself: “Is this a thing about which I want to store information?” If yes, it is probably an entity.

    The text contains plural forms (e.g., “animals”, “owners”). These often indicate entity sets. The singular form (e.g., “animal”, “owner”) is used as the entity type.

    There can also be homonyms and synonyms — choose one term and stick with it.

    Also notice whether there is a term that describes the overall system (e.g., “small animal clinic”). This could also be an entity, but it is usually the context, the title of the model, or the database as a whole.

    Summary:

    • Model: “Small Animal Clinic”
    • Entity types: “Owner”, “Animal”, “Dog”, “Cat”, “Visit”
  2. Define attributes: What properties (attributes) do the entities have? What information do you want to store about them?

    The following attributes result from the text:

    • Owner: Name, Address — these attributes do not appear in the fact base, but make sense to identify and contact the owner. Be generally careful with attributes not explicitly mentioned! They may seem sensible to you, but perhaps they are not in the given context or may take on a different meaning.
    • Animal: Name — this attribute also does not appear in the fact base, but makes sense to name the animal.
    • Dog (specialized from Animal): Breed, Chip Number
    • Cat (specialized from Animal): Coat Color, Neutering Status
    • Visit: Date, Reason, Price
  3. Define relationships and cardinalities: How are the entities related? What relationships exist between them?

    The following relationships result from the text:

    • Owner owns Animal (1:n): An owner can have multiple animals, but each animal belongs to exactly one owner.
    • Animal has Visit (1:n): An animal can have multiple visits, but each visit belongs to exactly one animal.
    • Animal is either Dog or Cat: Each animal is exactly one dog or one cat. This is a generalization/specialization.
  4. Choose keys: Which attributes make the entities uniquely identifiable?

    The following key attributes result from the text:

    • Owner: An ID as a sequential number. The name alone is not sufficient, as multiple owners can have the same name.
    • Animal: Also an ID as a sequential number. The name alone is not sufficient, as multiple animals can have the same name.
    • Dog: Chip Number (optional, but unique when present). This is in addition to the Animal ID.
    • Cat: No unique attributes mentioned in the text, so only the Animal ID.
    • Visit: Also an ID as a sequential number. The date alone is not sufficient, as an animal could have multiple visits on the same day.
  5. Draw the ER model: Draw the ER diagram in Chen Notation based on the entities, attributes, relationships, and cardinalities defined above.

    ER diagram of a small animal clinic in Chen Notation.
    Fig. 6.16: ER diagram of a small animal clinic in Chen Notation.

Information Engineering (IE) Notation, Martin Notation, or Crow’s Foot Notation

Section titled “Information Engineering (IE) Notation, Martin Notation, or Crow’s Foot Notation”

The Information Engineering (IE) Notation, also called Martin Notation or Crow’s Foot Notation, is an alternative graphical representation of the Entity-Relationship Model (ERM). It was developed by James Martin and is characterized by its clear and concise representation of entities, relationships, and cardinalities.

Chen Notation was the first graphical representation of the ER model and was therefore very influential and successful. However, over time it was found to be somewhat complex and cumbersome, especially in representing cardinalities and optional relationships.

The common disadvantages of Chen Notation:

  1. Too many symbols: Entities, relationships (diamonds), and attributes (ovals) quickly make large models confusing. There are many different symbols that take up a lot of space.
  2. Cardinality not immediately visible at the end: You often have to read the relationship (diamond), instead of seeing the cardinality directly at the end of the line.
  3. Poor scalability: With many entities/relationships, the diagram becomes very “wide” and difficult to present on one page.
Overloaded ER diagram in Chen Notation.
Fig. 6.17: An overloaded diagram in Chen Notation.
Source: [5]

As a result, a number of alternative notations emerged that wanted to address these problems. The Information Engineering (IE) Notation is one of the best known and has established itself in practice. It is today the standard in many companies and is frequently used in data modeling.

In the IE model too, entities are represented as rectangles. However, the rectangles in IE Notation are often kept simpler, without the additional lines for attributes as in Chen Notation. In IE Notation, attributes are listed within the rectangle of the entity, instead of representing them as separate ovals.

An entity in Crow's Foot Notation.
Fig. 6.18: An entity in Martin, IE, or Crow's Foot Notation. There is a dog with the attributes: ChipID, Name, Date of Birth.

This results in a more compact representation and makes reading the model easier.

Crow’s Foot Notation uses lines to represent relationships between entities. Cardinalities are displayed directly at the ends of the lines through special symbols.

The cardinalities in Crow's Foot Notation.
Fig. 6.19: The cardinalities in Crow's Foot Notation.

Through this representation, cardinalities are immediately visible at the end of the line, making understanding the relationships easier. The shape of the “many” cardinality gives the notation its name “Crow’s Foot”.

Examples of relationships between entities in Crow's Foot Notation.
Fig. 6.20: Examples of relationships between entities in Crow's Foot Notation.
Source: [6]

In the example above, these cardinalities are immediately recognizable:

  1. A student can enroll in many courses. Even in none. Conversely, each course has many students, but at least one.

  2. A customer can place many orders, but there are also customers without orders. Each order belongs to exactly one customer.

  3. An employee works in exactly one department. A department has many employees, but at least one.

For clarification of relationships, relationship names can also be added, similar to Chen Notation.

A relationship in Crow's Foot Notation with relationship name.
Fig. 6.21: A relationship between entities in Crow's Foot Notation with relationship name.

In (pure) Crow’s Foot Notation, there is no way to represent relationships with attributes. Instead, in such cases an associative entity is introduced. This associative entity represents the relationship itself as an independent entity with its own attributes.

In Crow’s Foot Notation, generalization/specialization is represented similarly to Chen Notation, but without the diamond. Instead, a triangle or sometimes a semicircle is used, symbolizing the generalization/specialization relationship. The triangle or semicircle is connected with lines to the participating entity types, with the tip of the triangle (or the center of the arc) pointing towards the more general entity type.

A generalization in Crow's Foot Notation.
Fig. 6.23: A generalization in Crow's Foot Notation: Dogs and Cats are both Mammals. Every instance of Dog and every instance of Cat inherit all attributes from the class Mammal: Date of Birth, Name, Gender. Additionally, dogs have a ChipID and cats have a coat color.

However, this notation is not as widely used as Chen Notation, and there are fewer standardized symbols for representing generalizations/specializations.

There are also variants that mark the subsets of the specializations (disjoint vs. overlapping) and whether the generalization is total or partial. However, these are not always used.

  • Partial: Not every instance of the superclass must belong to a subclass. The image above is an example of a partial generalization. Besides dogs and cats, there could be other mammals (e.g., rabbits, guinea pigs, hamsters, …).

  • Total: Every instance of the superclass must belong to at least one subclass. Here an additional crossbar is placed on the triangle or semicircle. In the example below, this means there are no other mammals besides dogs and cats in this model. That’s all: dogs and cats are the only mammals that exist. Period.

A total generalization in Crow's Foot Notation.
Fig. 6.24: A total generalization in Crow's Foot Notation: Dogs and Cats are both Mammals. Since this is a total generalization, every instance of the superclass (Mammal) must belong to at least one subclass (Dog or Cat). There are therefore no other mammals besides dogs and cats in this model.
  • Overlapping: The subclasses can coexist. An instance of the superclass can belong to multiple subclasses. In the example below, this means that a mammal can be both a dog and a cat (e.g., a “cat-dog”… if such a thing existed).

  • Disjoint: The subclasses are exclusive. An instance of the superclass can only belong to one subclass. In the example below, this means that a mammal is either a dog or a cat, but not both at the same time. For this, a cross is usually drawn in the triangle or semicircle.

A disjoint generalization in Crow's Foot Notation.
Fig. 6.25: A disjoint generalization in Crow's Foot Notation: Dogs and Cats are both Mammals. Since this is a disjoint generalization, every instance of the superclass (Mammal) can only belong to one subclass (Dog or Cat). There are therefore no mammals that could be both dogs and cats at the same time.

Summary:

All types of generalization in Crow's Foot Notation.
Fig. 6.26: All types of generalization in Crow's Foot Notation: 1) Partial and Overlapping, 2) Total and Overlapping, 3) Partial and Disjoint, 4) Total and Disjoint.

The steps to create the ER diagram in Crow’s Foot Notation are conceptually identical to those in Chen Notation. The only difference lies in the graphical representation.

ER diagram of a small animal clinic in Crow's Foot Notation.
Fig. 6.27: ER diagram of a small animal clinic in Crow's Foot Notation. Compared to Chen Notation, the diagram is significantly more compact and clearly presented.

Learning Outcomes: What you should be able to do after this chapter

Section titled “Learning Outcomes: What you should be able to do after this chapter”

After completing this chapter, students should be able to:

  • Name: name the components of an ER model (entities, attributes, relationships, cardinalities).
  • Describe: describe the steps of data modelling and the abstraction mechanisms (classification, aggregation, generalization).
  • Apply: translate a given real-world scenario into an ER diagram in Chen or crow’s-foot notation.
  • Compare: contrast the Chen notation with the Information Engineering (crow’s-foot) notation.
  • Design: independently design a conceptual data model for a given task.