Friday, January 04, 2008

fwd: 数据仓库 及 ETL

Kimball and Inmon

Raugh Kimball自下而上,Bill Inmon's approach is to introduce a fully normalized data warehouse into which all the data for a corporation is loaded, before flowing out into dependant data marts or data marts for specific purposes. This approach was previously known as Enterprise Data Warehouse, but more recently has become known by the somewhat awkward term Corporate Information Factory or CIF.

MS的经验是Kimball更合适:it's difficult to justify the effort and expense of creating and maintaining an additional normalized data warehouse, especially because of the difficulty of tying concrete returns on the investment back to this central data warehouse


主要term:
ODS (operational data store)
EDW (enterprise data warehouse)
Staging area
Presentation area
Data warehouse
Data mart

ROLAP/MOLAP传统:MOLAP只适用于低维,小数据量。用batch curve和query curve的曲线之和在processor requirements/degree of compilation坐标上比较得出:见书:Dw and decision support Vol2 Spiral Books (Inmon, also likes inverted index,克服b-tree or hashed inexing:以西部作为key,指向记录号1,3,12作)

2005:用缓慢变化维来抓dimension表

j2ee的模型都是beans,比如MVC设计模式中MailingBeans的实例,由MailingBeanFactory.newInstance()

Dimension tables are the context of a business’ measurements.
The chapter clearly illustrates how to:
Assign surrogate keys
Load Type 1, 2 and 3 slowly changing dimensions
Populate bridge tables for multivalued and complex hierarchical
dimensions
Flatten hierarchies and selectively snowflake dimensions

Fact tables hold the measurements of the business. In most data warehouses,
fact tables are overwhelmingly larger than dimension tables, but at the same
time they are simpler.

Chapter 9 defines the three types of metadata—business, technical,
and process—

Atomic and Aggregate Fact Tables按时间段partition事实表,可以减少可能的查询扫描

The grain of a fact table isis the definition of what constitutes a unique
fact table record. I


The preceding example illustrates the need to make all ETL jobs reentrant so
that the job can be run a second time, either deliberately or in error, without
updating the target database incorrectly.

如今的olap已经可以含和生产数据库一样粒度
维度表与事实表区别:事实表的每列就是一个维度

事实表由表的粒度grain定义

所有事实表都有外键连到纬度,以确定其衡量dimension
one or more numerical measurement fields, which we call facts.

事实表通常有3个及以上facts.early retail databases usually had only three or four dimensions (usually product, market, time, and promotion).几乎每个事实表都有由几个字段组成的pk

We will see that this grain can then later be expressed in terms of the dimension
foreign keys and possibly other fields in the fact table,

维度表It is a superbly flexible, symmetric framework suitable for all classes of queries。Similarly, the best dimensions are verbose descriptions of the dimensional
entities.
Dimensional attributes are mostly textual or are numbers that take on discrete
values. Dimension tables should always be built with a single primary
key field that is a simple meaningless integer assigned by the ETL process.
These keys are called surrogate(替换) keys.


The primary surrogate keys in each dimension are paired with corresponding
foreign keys in the fact table. When this primary-to-foreign key
relationship is adhered to, we say that the tables obey referential integrity.


数据仓库,是在数据库已经大量存在的情况下,为了进一步挖掘数据资源、为了决策需要而产生的,它决不是所谓的“大型数据库”。那么,数据仓库与传统数据库比较,有哪些不同呢?让我们先看看W.H.Inmon关于数据仓库的定义:面向主题的、集成的、与时间相关且不可修改的数据集合。
  “面向主题的”:传统数据库主要是为应用程序进行数据处理,未必按照同一主题存储数据;数据仓库侧重于数据分析工作,是按照主题存储的。这一点,类似于传统农贸市场与超市的区别—市场里面,白菜、萝卜、香菜会在一个摊位上,如果它们是一个小贩卖的;而超市里,白菜、萝卜、香菜则各自一块。也就是说,市场里的菜(数据)是按照小贩(应用程序)归堆(存储)的,超市里面则是按照菜的类型(同主题)归堆的。
  “与时间相关”:数据库保存信息的时候,并不强调一定有时间信息。数据仓库则不同,出于决策的需要,数据仓库中的数据都要标明时间属性。决策中,时间属性很重要。同样都是累计购买过九车产品的顾客,一位是最近三个月购买九车,一位是最近一年从未买过,这对于决策者意义是不同的。
  “不可修改”:数据仓库中的数据并不是最新的,而是来源于其它数据源。数据仓库反映的是历史信息,并不是很多数据库处理的那种日常事务数据(有的数据库例如电信计费数据库甚至处理实时信息)。因此,数据仓库中的数据是极少或根本不修改的;当然,向数据仓库添加数据是允许的。
  数据仓库的出现,并不是要取代数据库。目前,大部分数据仓库还是用关系数据库管理系统来管理的。可以说,数据库、数据仓库相辅相成、各有千秋。
  补充一下,数据仓库的方案建设的目的,是为前端查询和分析作为基础,由于有较大的冗余,所以需要的存储也较大。为了更好地为前端应用服务,数据仓库必须有如下几点优点,否则是失败的数据仓库方案。
  1.效率足够高。客户要求的分析数据一般分为日、周、月、季、年等,可以看出,日为周期的数据要求的效率最高,要求24小时甚至12小时内,客户能看到昨天的数据分析。由于有的企业每日的数据量很大,设计不好的数据仓库经常会出问题,延迟1-3日才能给出数据,显然不行的。
  2.数据质量。客户要看各种信息,肯定要准确的数据,但由于数据仓库流程至少分为3步,2次ETL,复杂的架构会更多层次,那么由于数据源有脏数据或者代码不严谨,都可以导致数据失真,客户看到错误的信息就可能导致分析出错误的决策,造成损失,而不是效益。
  3.扩展性。之所以有的大型数据仓库系统架构设计复杂,是因为考虑到了未来3-5年的扩展性,这样的话,客户不用太快花钱去重建数据仓库系统,就能很稳定运行。主要体现在数据建模的合理性,数据仓库方案中多出一些中间层,使海量数据流有足够的缓冲,不至于数据量大很多,就运行不起来了。


Specifically, the ETL system:
Removes mistakes and corrects missing data
Provides documented measures of confidence in data
Captures the flow of transactional data for safekeeping
Adjusts data from multiple sources to be used together
Structures data to be usable by end-user tools

分为plan&designing stream(需求-〉设计-〉实现-〉测试,部署)和data flow stream(extraction, clean, conform, deliver)

The extract step includes:
Reading source-data models
Connecting to and accessing data
Scheduling the source system, intercepting notifications and
daemons
Capturing changed data
Staging the extracted data to disk
The clean step involves:
Enforcing column properties
Enforcing structure
Enforcing data and value rules
Enforcing complex business rules
Building a metadata foundation to describe data quality
Staging the cleaned data to disk
This step is followed closely by the conform step, which includes:
Conforming business labels (in dimensions)
Conforming business metrics and performance indicators (in fact
tables)
Deduplicating
Householding
Internationalizing
Staging the conformed data to disk

Data delivery from the ETL system includes:
Loading flat and snowflaked dimensions
Generating time dimensions
Loading degenerate dimensions
Loading subdimensions
Loading types 1, 2, and 3 slowly changing dimensions
Conforming dimensions and conforming facts
Handling late-arriving dimensions and late-arriving facts
Loading multi-valued dimensions
Loading ragged hierarchy dimensions
Loading text facts in dimensions
Running the surrogate key pipeline for fact tables
Loading three fundamental fact table grains
Loading and updating aggregations
Staging the delivered data to disk

如果只有一个巨大monster维度表(比如上亿用户),仍有可能快速插入fact。秘密:对维度表的lookup table和插入数据都提前派序 (用natuaral key)。

查找表:含natual key 和surrogate key

为每个维度维护一个special surrogate key lookup table,这样加快查找。每个dimension entity建立或者type 2变化发生时更新

但如果fact纪录迟到,则无法用查找表

type1 没有历史

0 Comments:

Post a Comment

<< Home