|
Testing Data Warehouse DSSHere are my notes on how to manage data in computers within a large enterprise. Take the Brainbench certification test on Data Warehousing Concepts. | Related:
|
|
Performance Testing Decision Support SystemsData warehouses are built with materialized views. Unlike an ordinary view, which does not take up any storage space because they are generated on the fly, materialized views provide indirect access to table data by storing the results of an aggregation query in a separate schema object. A materialized view definition can include any number of aggregations (SUM, COUNT(x), COUNT(*), COUNT(DISTINCT x), AVG, VARIANCE, STDDEV, MIN, and MAX). Materialized summary tables use aggregate keys to define a hierarchy of aggregation. Several dimensions can be contained in each aggregate key. Do retrievals at different level of aggregation exihibt different performance? Materialized views can be stored as summary tables in the same database as their base tables. This can improve query performance within OLTP systems (when Oracle QUERY REWRITE and init.ora COMPATIBLE is enabled). Most databases do not have enough CPU capacity to handle both OLTP and the heavy demands of OLAP processing. So to ensure good performance, data marts generally make materialized views available in a separate schema on another machine than the OLTP system. The two databases are sychronized on a nightly (rather than instantaneous) basis.
Materialized views stored in the same Oracle database as the master table's data use a materialized view log schema object to record changes to be refreshed incrementally.
The TPC-H benchmark software includes a random data generation program that can be handy even if you're not running the benchmark. |
Basic Concepts
|
Manual filing systems |
Inflows
Data from OLTP and legacy systems provide Inflow into staging servers of a data warehouse. For example, in a bank, data is gathered from loan processing, pass book processing, and accounting systems. In a retail store, data is gathered from point-of-sale devices, cash registers, and entry/exit monitors. The first step is typically data cleansing. MMD's (Multi-Dimensional Databases) address the use of data generated by on-line transaction processing (OLTP) systems. Neil Raden's Modeling the Data Warehouse article excerpted in the January 29, 1996 issue of Information Week, identified these differences:
Collected data typically go through cleaning and transformation before being mapped and loaded into the warehouse. Examples of data cleaning include removing inconsistencies, adding missing fields, and cross-checking for data integrity. Examples of data transformation include adding date/time stamp fields, summarizing detailed data, and deriving new fields to store calculated data. Information about data in the warehouse (such as the location, description, and other information) about the data structures is referred to as Metadata. Like a library card catalog, metadata helps the reader determine if an item of information exists in the library and if it does, provides a description of it and points to its location.. The technical description for each item of information include file name, file type, location, data source, rules used in cleaning or mapping, and date of creation or last access. The business catalog enables end users to interpret the contents of a data warehouse by: |
Get this print framed for your wall!
|
Upflows
Data warehouses provide an Upflow of summaries rolled-up from the detailed data. Averages and summaries are pre-calculated and stored as a separate unit to answer common queries raised in decision-making. Companies offering products to perform ETL (Extraction, Transformation, and Loading) include:
These tools are faster and less error prone than manual scripting because they generate SQL code based on metadata about changing sources or targets. NOTE: The OWB repository can be exported as a metadata loader (.mdl) file type. ETL tools also provide a visual record of the process that can be adjusted when sources or targets change. A variant of ETL is ELT (Extract, Load, and then Transform).
The greatest value (payback) from a Data warehouse is in the way it provide quick and flexible
Outflows
of transformed data that leads its users to actionable insights
useful to forecasting and planning decisions.
This is why such systems are called "Business Intelligence".
The heart of a data warehouse is its planning and analysis applications called On-Line Analytical Processing (OLAP). The term "MOLAP" for Multi-dimensional OLAP is also used because, unlike OLTP entity-relationship models consisting of two-dimensional tables, data warehouses use a multi-dimensional model for storing data. Information in a Data Warehouse is organized into various dimensions. For example,
The most common basis for summarization are the dimensions:
To group applications such as Finance, Sales, etc. which can share common dimensions, SAP's Business Planning and Consolidation (BPC) application uses the concept of Application Sets, equivalent to a single MS Analysis Services database. Each dimension, such as time, can be structured in a hierarchy of consolidation levels -- years, quarters, months, weeks, individual days, or other level of data granularity. But "day of the week" is an extended attribute. The lower (finer, more detailed) the level of granularity available for analysis, the more costly it is to store and process the data. Other dimensions depend on business needs:
At the center of the data model, measures (numeric attributes such as sales dollars, Invoice Amount, etc.) are stored in a fact table. To make access information multi-dimensionally, fact tables also contain several foreign keys used to join facts to several dimension tables. Dimension tables organize and index the data stored in a fact table. A visual representation of this connection between fact tables and dimension tables appears as a star. |
Larry Greenfield's Data Warehousing Information Center is the ultimate portal on data warehousing, decision support, and data mining. Included is a "rant and rave" on the definition of data warehousing. Neil Raden offers Star Schema 101, Data, Data Everywhere a white paper excerpted in the October 30, 1995 issue of Information Week on selecting the OLAP technology, and ocsqlug-unsubscribe@yahoogroups.com Ralph Kimball, Data Warehouse Consultant. Chuo-Han's Data Wharehousing resource site The Data Warehousing Institute. Directory of Data Warehouse, Data Mining, and Decision Support Resources Technology Guides for Data Warehousing Professionals Seth Grimes' OLAP Pages has links to resources specifically related to On-Line Analytical Processing (OLAP). TeraCLIN provides the application of star schemas to health care. High-end databases:
|
A surrogate key is used to maintain a hierarchy. For example, the LOCATION table's Store_id consists of 3 hierachical levels: Area, Region, and Store codes.
Population values are summarized at each level in the hierarchy.
The fact table's primary key is a concatenated key containing a concatenated key which consists of the foreign keys from every dimension table.
This star model is the end-user's view of data.
The star model is not "normalized". Normalization would turn the data models into looking like a snowflake where dimension tables are joined to other dimension tables.
Most warehouses end up with a mixed model to balance speed and complexity.
For faster queries, Mini-dimensions contain a subset of a larger dimension. Minidimensions contain just current data, a filtered set of rows, or a subset of attributes.
| Your first name: Your family name: Your location (city, country): Your Email address: |
Top of Page
Thank you! |