Certified Data Mining and Warehousing Professional Monitoring and managing data growth

Monitoring and managing data growth
 


The world of data warehousing is an entirely different world. High performance is not an issue. Indeed, if performance is deemed to be acceptable, then that is good enough. But there still are things to be monitored in the world of data warehouse analytical processing. The number one thing that the analyst needs to be wary of is dormant data. Dormant data is data that is not being accessed that resides in a data warehouse. In the early days of a data warehouse, there is hardly any dormant data to be found. But as a data warehouse grows older, dormant data starts to collect.

Dormant data in a data warehouse environment is like cholesterol in the bloodstream of the body. With enough cholesterol, the heart has to pump extra hard just to move blood around. In addition, dormant data costs money. It costs in terms of wasted storage, and it costs in terms of processor cycles needed to move data through a system. Dormant data is not good for the health of the data warehouse environment.

To monitor the activities inside the data warehouse once it has been built. The main types of activities that happen within the data warehouse are:

· data loads,

· queries and reports,

· data archives,

· backups/recoveries.

Monitor Data Usage

Determine data usage by identifying which:

· rows of data are and are not being used,

· columns of data are and are not being used.

Data usage must be determined at the row and column level rather than at the table level because only one row in a table may be accessed or even if most rows in the table are accessed, only a few columns may be accessed.

Monitoring data usage also identifies when data is used (i.e., statistical inference). For example, some data may only be used once a year.

Data that is not used or is infrequently used should be removed from the data warehouse or moved to near-line or off-line storage.

Monitor Data Warehouse Users

Monitor user usage. Identify which users access the data warehouse frequently and which users don't. Monitoring user usage helps identify heavy users of the data warehouse and allows the data warehouse support staff to provide additional support.

Monitor Response Time

Monitor response time. Monitoring response times helps identify problems and ensures that adequate response times are maintained. Response times must be measured over a long period of time to adequately assess what is really happening.

Monitor Data Warehouse Activities

Monitor the activities within the data warehouse. Identify the makeup of the workload going through the data warehouse. Identifying the types of activities (e.g., large queries) and when they occur (e.g., data loads running during peak query hours) helps identify performance issues.

Tips and Hints

An activity monitor that monitors decision support system (DSS) processing (not on-line transaction processing) is required to effectively monitor the data warehouse.

 

A large data warehouse collects data from a variety of external data repositories and sources. The collected raw data bubbles up through multiple processing layers within the data warehouse and finally is transformed into information consumable by decision-makers on multiple levels of responsibility. On each step of the data collection and transformation variety of business rules, selection criteria and data mappings are applied to the data stream, creating opportunities for data being lost or corrupted. Another contributing factor to the possible inaccuracy of the information coming from a data warehouse stems from the need to combine data from diverse data sources that have never been designed for cross-system data integrity.

A complex data warehouse system includes a variety of diverse data sources and information repositories, with multiple data extraction and transformation layers. This environment is prone to possible data losses and discrepancies. Controlling data completeness and integrity is vital for the accuracy of reports and metrics originated out of the data warehouse. Special systems are required to periodically measure data discrepancies and deliver the findings to support organizations in a timely manner. Analysis of the data discrepancy measures over a period of time allows for increasing overall accuracy, completeness and timeliness of the resulting reports and metrics as well as improvement of business processes deployed to ensure underlying data quality.


Because of business needs for data retention and regulatory compliance, enterprises need to manage increasingly large databases ranging from hundreds of gigabytes to many terabytes, or even petabytes, in size. As data continues to grow at an exponential rate, DBAs and IT professionals in these organizations face daunting challenges when designing and operating such large databases. The data must be well organized to effectively cope with data growth and to meet service requirements. Challenges include how best to achieve the following objectives.
  • Designing databases that can accommodate continuous data growth.
  • Keeping database systems lean and high performing.
  • Managing data lifecycles more efficiently and less intrusively to keep operational data highly available.
  • Reducing the cost and impact of data maintenance operations, such as backup and restore operations, to keep mission-critical data ready when needed, and reorganization and runstats operations to maximize system performance.
  • Satisfying near real-time requests for transactional data or the complex analytical query requirements for large data sets, often including historical data, while reducing the total cost of ownership (TCO)

The explosio

n in data, query complexity and low latency demands has come at a time when another trend has been at work in data warehousing. Tighter IT budgets and pressures to reduce costs are placing a strong emphasis on justifying expenditures through demonstrable returns on investment. How can companies balance these seemingly conflicting demands? Now more savvy about the total costs of data-warehouse operations and aware of the management and database administration challenges, IT buyers are demanding lower prices from vendors and more attention to their specific needs.

Budget pressure is endemic in an age of global competition, but the BI and data-warehousing community has been under special scrutiny from business-side executives. In our experience, as many as 75 percent of data warehouses have fallen short of producing business benefits commensurate with the expenditures. Although vendors spotlight on the smaller percentage that has experienced fabulous success, the vast majority are still looking for the big business upside.

Vendors, therefore, face a tall order: solve more difficult problems even as they respond to greater pressure on pricing. The most established data warehouse platform vendors--IBM, Microsoft, Oracle and NCR's Teradata Division--share a focus on four principal objectives:

• Increase the capability to handle scale and complexity

• Simplify operation, management and use

• Incrementally adjust software pricing downward

• Pass along hardware savings

Let's take a closer look at what's most interesting about how each of these vendors is attacking these objectives.

IBM. With DB2 version 9, the company is simplifying the buying process by introducing a preconfigured unit of data warehouse capacity: the balanced configuration unit. The BCU sets out what a customer would need in DB2 software, servers, storage and interconnect technology to meet specific objectives. Design wizards advise users on physical implementation techniques. To improve performance, scalability and manageability, version 9 also offers range partitioning and data compression; these technical features work in concert with existing multidimensional clustering to help customers reduce I/O workload and overall latency.

Microsoft. SQL Server 2005 was a major advance for Microsoft. The largest SQL Server data warehouse identified in the 2005 TopTen Survey contained 19.5 TB of data, which was more than 10 times larger than the biggest SQL Server data warehouse in 2003. The 2005 version introduced range partitioning for both tables and indexes. Microsoft also offers 64-bit support to improve data caching and reduce I/O, plus a new set of data integration services for extract, transformation and loading (ETL).

Oracle. In the latest version of its database, Oracle 10gR2, Oracle offers a grid architecture that uses high-speed interconnect networks to arrange storage and server resources. Oracle 10gR2 promises that applications, databases and workloads can be distributed and managed transparently: that is, without users needing to grapple with knowing about the underlying platforms. Oracle's long-term vision is that it should be possible and practical to expand, reconfigure and adapt the grid dynamically to meet changing requirements and ebbs and flows in traffic; respond to equipment, power and network failures; enable maintenance; and expedite or slow down work in response to changing priorities. While Oracle plans a series of releases to fully realize these goals, 10g's Real Application Clusters (RAC) technology is already in widespread use for data warehouses.

NCR/Teradata. As an extension to its pre-existing toolset, the Teradata Active System Management (TASM) facility is an important advance for Teradata customers trying to gain the upper hand in managing large workloads. In coming releases, the company's goal is to enable users to specify service levels and define priorities and policies--and let the system do almost everything else by itself. This includes supporting both the short, rapid queries necessary for "active," operational data warehousing as well as conventional, long-running, strategically oriented queries and large batch jobs. Advances in parallelism, query processing and optimization are also on Teradata's agenda.

Teradata, which did much to define data warehousing originally, remains the farthest ahead in addressing enterprise data-warehouse requirements: that is, multiple scalability challenges shown in the second figure that are driving growth toward monster systems. Microsoft has the furthest to go. However, all will be major players in the enterprise data warehouse game in the coming years.

 

Getting A Handle On Growth

What should you do to prevent data warehouse requirements from surpassing your ability to manage them? You don't have to be implementing a 500 TB warehouse to encounter challenges in performance, scalability or availability. The complexity of the problems often stumps organizations more than just sheer scale. A company might have less than a terabyte of user data but face such complex business rules that no database system could easily handle the benchmarking. With just 1 TB of data and a sufficiently complex schema and workload, you'll find yourself up against some tough problems.

The first step is to understand the business drivers. Make sure your business-level requirements and schedules reflect those drives accurately. Then be sure to define those requirements by year over a strategic time frame (typically three to five years). Third, develop a set of concrete usage scenarios that are characteristic of the workload that the business requirements will demand.

With the business drivers and requirements clear, you can focus more closely on the data-warehouse issues, including database size, structure, workload and service-level agreements. Build a margin of safety into your requirements; you'll never have complete certainty about how the system will be used and what might ultimately expand those requirements (such as success!). Then it's time to evaluate the long-term and near-term trade-offs:

• In the long term, the ability to leverage data from across the enterprise pays huge dividends, and is the key to most of the success stories you hear about. Clearly, the core of your enterprise data must be integrated to provide a platform for rapid and inexpensive implementation of analytical solutions.

• In the near term, you want to identify where you can gain significant cost or performance advantages in a specific application or sandbox with a data-mart or appliance approach. Remember that some benefits may come at the price of fragmenting decision support data and incurring greater overhead for replication, ETL and other data movement and integration.

Testing your intended solution against your detailed requirements is critical before committing to it. This advice applies to everything that will be important to handle scalaby: the platform, configuration, database design and more.

Remember, previous experience is not always a good predictor of future success. And don't simply wait and hope that scalability problems will work themselves out. Take the initiative to prove to yourself--not to mention all your stakeholders within and outside the enterprise--that the solution you've chosen will scale and be strong enough to withstand the challenges of many kinds of growth.

 

 For Support