Data and architecture design
Data architecture in Information Technology is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations. A Data Architecture is often the design of data for use in defining the target state and the subsequent planning needed to achieve the target state. It is usually one of several architecture domains that form the pillars of an enterprise architecture or solution architecture.
A data architecture should set data standards for all its data systems as a vision or a model of the eventual interactions between those data systems. Data integration, for example, should be dependent upon data architecture standards since data integration requires data interactions between two or more data systems. A data architecture, in part, describes the data structures used by a business and its computer applications software. Data architectures address data in storage and data in motion; descriptions of data stores, data groups and data items; and mappings of those data artifacts to data qualities, applications, locations etc.
Essential to realizing the target state, Data Architecture describes how data is processed, stored, and utilized in a given system. It provides criteria for data processing operations that make it possible to design data flows and also control the flow of data in the system.
The Data Architect is typically responsible for defining the target state, aligning during development and then following up to ensure enhancements are done in the spirit of the original blueprint.
During the definition of the target state, the Data Architecture breaks a subject down to the atomic level and then builds it back up to the desired form. The Data Architect breaks the subject down by going through 3 traditional architectural processes:
- Conceptual - represents all business entities.
- Logical - represents the logic of how entities are related.
- Physical - the realization of the data mechanisms for a specific type of functionality.
The "data" column of the Zachman Framework for enterprise architecture –
Layer | View | Data (What) | Stakeholder |
1 | Scope/Contextual | List of things and architectural standards important to the business | Planner |
2 | Business Model/Conceptual | Semantic model or Conceptual/Enterprise Data Model | Owner |
3 | System Model/Logical | Enterprise/Logical Data Model | Designer |
4 | Technology Model/Physical | Physical Data Model | Builder |
5 | Detailed Representations | Actual databases | Subcontractor |
In this second, broader sense, data architecture includes a complete analysis of the relationships between an organization's functions, available technologies, and data types.
Data architecture should be defined in the planning phase of the design of a new data processing and storage system. The major types and sources of data necessary to support an enterprise should be identified in a manner that is complete, consistent, and understandable. The primary requirement at this stage is to define all of the relevant data entities, not to specify computer hardware items. A data entity is any real or abstracted thing about which an organization or individual wishes to store data.
Data Design
Designing data is about discovering and completely defining your application's data characteristics and processes. Data design is a process of gradual refinement, from the coarse "What data does your application require?" to the precise data structures and processes that provide it. With a good data design, your application's data access is fast, easily maintained, and can gracefully accept future data enhancements.
The process of data design includes identifying the data, defining specific data types and storage mechanisms, and ensuring data integrity by using business rules and other run-time enforcement mechanisms.
Data describes a real-world information resource that is important to your application. Data describes the things, people, products, items, customers, assets, records, and — ultimately — data structures that your application finds useful to categorize, organize, and maintain.
Identifying data is an iterative process. At first you may just know some vague, high-level details about how the application must handle its information. As you keep expanding your knowledge of the application's intended business processes, you continue filling in more details.
As you begin documenting the data requirements for your application, the description for each item of data typically includes:
- Name.
- General description (what is it?).
- Ownership (who is responsible for it?).
- Data characteristics (how is it measured and how big or small can it be?).
- Logical events, processes, and relationships (how and when is it created, modified, and used?).
It is worth noting that data has many different characteristics. Part of the process of data design is to specify how to quantify each data item. Some typical data characteristics are:
- Location attributes (address, country, warehouse bin).
- Physical attributes (weight, dimension, volume, color, material, texture).
- Conceptual attributes (name, rank, serial number).
- Relational attributes (assemblies consist of sub-assemblies, authors write multiple books).
- Value attributes (currency, goodwill, esteem).
The process of identifying data requires interviews, analysis of existing data structures, document preparation, and peer reviews. The eventual result is a documented, conceptual view of your application's information that answers the data questions of "What, where, when, and why?" Generally, this is an early-stage exploration of how the various departments, organizations, and your application need to use data.
The general approach to defining data includes the following:
- Defining the tables, rows, and columns.
- Inserting index keys.
- Creating table relationships.
- Assigning data types.
Defining the Tables, Rows, and Columns
Regardless of how your application's data is physically stored, the data is typically organized into multiple tables (or files), each having a set of rows (or records) and columns (or fields), similar to the rows and columns of a spreadsheet. Each row in the table contains all of the information about a particular thing, person, product, item, customer, or asset.
For example, the following Authors table stores the name, address, and phone number for several authors.
Row | Name | Address | Phone |
---|---|---|---|
1 | Arnie Baldwin | xxxxx xxxxxxxx | (xxx) xxx-xxxx |
2 | Don Hall | xxxxx xxxxxxxx | (xxx) xxx-xxxx |
3 | Teresa Atkinson | xxxxx xxxxxxxx | (xxx) xxx-xxxx |
4 | David Simpson | xxxxx xxxxxxxx | (xxx) xxx-xxxx |
Inserting Index Keys
A key is a special field that provides an index for fast retrieval. A key can be unique or non-unique, depending on whether duplicates are allowed. A key can be designated as the primary key, making it the unique identifier for each row of the table. You should use keys where your application needs direct access to specific rows.
For example, in the following table the author's identification number (au_id) is added as the table's primary key, so that au_id uniquely identifies one and only one author. A query using a value for au_id will provide very fast retrieval of that author's information.
Authors table |
---|
au_id (key) |
au_name |
au_address |
au_phone |
Creating Table Relationships
A database is usually composed of more than one table, and the tables are often related to one another in various ways, often by using foreign keys. For instance, a Titles table might list the International Standard Book Number (ISBN), title, and year the book was published. It would also be useful to identify the publisher for each title. Rather than repeating all of the publisher information for each title in the Titles table, you could create a Publishers table and establish a relationship with the Titles table by including the publisher's identification (pu_id) in the Titles table as a foreign key.
In the following example, the Publishers table is related to the Titles table:
Titles table | Publishers table |
---|---|
ti_isbn (key) | pu_id (key) |
ti_title | pu_name |
ti_yearpublished | pu_address |
pu_id (foreign key) | pu_phone |
This is called a one-to-many relationship because a single row in the Publishers table can be related to one or more rows in the Titles table (because a publisher will handle many titles). You can also create relationships that are one-to-one and many-to-many.
It is worth noting that you have only identified that a relationship exists between the Publishers and Titles tables — you have made no commitment to how that relationship will be managed. Depending on your final implementation, you might use the table joins and foreign key constraints available with SQL Server, or you might write custom code to read the file structures directly and handle referential integrity inside your application.
Assigning Data Types
A data type is a named category of data characterized by a set of values, a way to denote the values, and some implied operations that can interpret and manipulate the values. Data types can be either intrinsic or derived.
An intrinsic data type is one that is provided by your database. For example, SQL Server provides intrinsic data types such as integer, datetime, bit, char, and varchar.
Derived data types are defined using the Data Modeling Language (DML) provided by your database. A derived data type is built from the available intrinsic data types or previously defined derived data types. You typically provide a name and a structure for the derived data type. With derived data types, you can assure consistent use of special data types for selected columns, variables, and parameters.
Data types are important because they assure that the assigned data value is of the correct type and within the acceptable range of values. Different data storage technologies and development languages support a variety of data types, including the following.
- Boolean.
- Integer.
- Float.
- Datetime.
- Tinyint.
- DISPLAY.
- COMP-3.
- Binary.
- String.
- Character.
While assigning data types, you want to be sure that the range provided by the data type fits the data which will be stored and — hopefully — anticipates future changes. For example, if you choose tinyint as a data type for customer identification, your application can handle a maximum of only 255 customers. On the other hand, if you choose the integer data type, you can have over two billion customers. As another example, if you use a single character to indicate customer service code, a later expansion to two characters will require difficult changes throughout your application.
You can save space in your database and improve join operations by choosing appropriate data types for fields. As a general rule, you should choose the smallest data type that is appropriate for the data in the field.
While you are assigning data types, some things to think about are:
- Allowable maximums and minimums.
- Default values.
- Empty (or NULL) values.
- Anticipated growth.
- Expected and (to the extent possible) unexpected changes.
In a relational database environment, data types help enforce the business rules. For example, you cannot add dollars and colors and get a useful answer. While you would never program such an addition process intentionally, a relational database will identify the data type mismatch and automatically deny your query.
BI architecture Design
Every business intelligence (BI) deployment has an underlying architecture. The BI architecture is much like the engine of a car – a necessary component, often powerful, but one that users, like drivers, don’t always understand. For some companies new to business intelligence, the BI architecture may primarily be the operational systems and the BI front-end tools. For more mature BI deployments and particularly for enterprise customers, it will involve ETL (extract, transform, and load) tools, a data warehouse, data marts, BI front-end tools, and other such components.
When IT discusses BI with users, we readily fall into techno babble, and senseless acronyms abound. Most car drivers know that cars have a battery, a transmission, a fuel tank – an adequate level of knowledge for having a conversation with a mechanic or salesperson but arguably not so much expertise to begin rebuilding an engine.
Conceptual BI Architecture
The Conceptual View provides a broad overview of the entire BI vision.
A conceptual BI Technical diagram is useful for illustrating how all the BI technology fits together, and assists with both the strategy definition and subsequent implementation planning.
Data Architecture View
The data architecture provides understanding as to what data structures are to be implemented, how that data is stored in each and how the data will propagate throughout the warehouse environment.
The following sub-sections may/may not be included:
Data architecture goals and constraints – this may include data integrity items such as “all warehouse-centric data must first be incorporated into the atomic layer, with all subsequent use of that data sourced from this single data structure”. Constraints might also include third-party data or technologies.
Logical data architecture - logical models that support data architecture goals. These will be limited until a specific warehouse iteration has been reached in the BI Roadmap. However, it is possible to provide a subject area model of the enterprise or a series of subject area models that describe core subjects within the enterprise. It may also include rules for mutating raw source data into atomic-level data and even guidelines defining how and when to use star schemas and OLAP cubes.
Architecturally significant design components – this may include items such as: establishment of an atomic layer or an ODS or an enterprise cube farm, or specific data elements such as geospatial data structures or specialized data staging.
Test plans – pre-rollout test plans are a critical part of the BI Roadmap. This section includes a testing and acceptance standard for all warehouse iterations, including: test templates, criteria for enterprise adherence and approval, criteria for test data selection and performance testing (including unit, suite and stress testing).
Data architecture often starts at a high level and evolves details of the architecture with each successive iteration of the warehouse.
Technical Architecture VIew
The technical architecture focuses on tangible components of the BI environment. Components must be described sufficiently, with technical diagrams and related textual detail. At the beginning of the BI program you will only have high level visions of the technical architecture, such as wanting to implement star schemas in a relational database. At this stage you do not know which relational database management system (RMDBS) vendor will be implemented. The technical architecture will evolve as updated versions as necessary.
Including a technical architecture diagram provides a general understanding of the current architecture as well as future architecture as details are confirmed.
The technical architecture section may contain:
- Technical architecture goals and constraints - specific to tangible components.
- Technical architecture - the hardware, software and network/communication components.
- Architecturally significant design components – for instance, you may require a 7x24 implementation with mirroring across two distinct data centers.
Implementation View
The implementation view also typically starts as a high-level perspective, with detail added as it becomes known.
This section is compiled by designers and project planners to:
- Establish the guidelines necessary for building and maintaining the purposed warehouse structures and related technologies.
- Detail the implementation of core processes
- Outline the sequence of establishing data structures.
The implementation view will contain three distinct perspectives: the overall strategy, architecture and process.
Strategy
The strategy subsection includes:
Timelines and Resource availability schedules to help planning and prioritizing of BI iterations – using process flows.
Funding
Architecture
Includes details such as:
- Size and performance requirements
- Data quality issues
- Meta data control and retention policies.
The purpose is to indicate potential areas of impact, such as, retention policies will impact both data architecture [partitioning] as well as technical architecture [disk storage].
Process
Outlines high-level process issues such as:
- Refresh rates
- Backup/recovery
- Archive
- Workflow
- Security