Crypto, data analysis and BI商业智能，数据挖掘和比特币: August 2008

Sunday, August 31, 2008

Windmill Coverletter and CV sugggestion

CL:

User Dear Forename when available;
Always type, one side.
Use thick/good paper, not normal 80g/m2, but 90/100 gm
Explain how you obtained key requirements and give evidence
Use 1st class mail

AT THE END of the inverview ask "will I be able to get feedback" to create a positive hook and leave the door open even if rejected

Set an objective for networking "I want six more people to be aware of the skills I can offer", be persistent and try again later (ask for permission？

group interview:用笔记录的同伴乱记，用了不在材料上的乱设条件：应该提醒
用xy轴图标risk/likely图，表示有条理，量化，来选择risk

Bluetree:

INTERVIEW QUESTIONS & SAMPLE ANSWERS

Questions you may be asked in Interview

This list will help give you that vital edge in interviews. The trick is to find out what your client is looking for. Once you feel you know this your confidence will grow. Below is a list of questions employers often ask (including some difficult ones). After each question we we explain what the interviewer is really looking for. Remember to put yourself in the employer’s shoes and think about what lies behind each line of questioning.

Tell me about yourself

Employers are looking for a quick snapshot of you (both your background and your personality) and how well you sell yourself and your capabilities. Don’t ramble on.

Why did you apply for the job?

This looks at your levels of motivation and commitment. Make sure you research thoroughly what the job entails. State the benefits you feel you can offer. Say why you want this job – not why you are leaving your present one.

Tell me what you do in your spare time?

This question has a double purpose. To make sure that you have a fully rounded personality and to ensure that your hobbies won’t interfere with your job. Go over any outside interests quickly, highlighting any job relevance and outlining the skills you have developed through them.

When have you been involved in teams?

Employers want a team player, so give examples of your role within teams (eg creative, promoter, developer, organiser, inspector, maintainer, adviser). Underline what you learned and how it has made you more effective in a team. Link your answer directly to the job you’re after – check if they are looking for a creative, resourceful team member, a detail orientated person who will see tasks through or a positive team leader, and then tailor your answer accordingly.

What are your main strengths and weaknesses?

This revolves around self awareness. Again, link your strengths to the particular job. Employers want someone who knows what they are good at and where they need to improve. Everybody has a weakness but employers want to know what you are doing to improve. Choose a positive weakness and turn it into a strength eg ‘I’m a bit of a perfectionist but that’s good for quality’, ‘my financial skills aren’t as sharp as I’d like but I’m attending an evening class in bookkeeping’

Why should we employ you?

What skills do you have that could add value to the company? Make brief but telling comparisons between the job description and your ability to meet their needs. State briefly what you can offer and back up anything you say with facts.

What has been your biggest achievement?

This reveals what motivates you (family, work, education or leisure). Choose something that makes you stand out and involves positive characteristics e.g. you developed determination, strength of character etc.

What have you learned from your past work experiences?

This focuses on the skills developed in previous jobs (vacation, part-time, full-time). Think about those jobs. Did you have any responsibility? Pull out the positive elements and focus on benefits to the employer.

When did you last work under pressure or deal with conflict – and how did you cope?

This is aimed at discovering if you can deal with problems quickly and efficiently and confront a situation if you become frustrated. The best technique is to think of an example and explain how the situation arose – then say how you dealt with it. If asked directly if anything made you annoyed or frustrated, be truthful but avoid appearing negative.

What is the biggest/problem/dilemma etc you’ve ever faced?

Try to choose something that will show you in a positive light. How did you get over it? What did you learn? Try and keep it work related if possible and not eg about an ongoing dispute with your neighbour. Your answer will not only show how you cope under stress but also your decision making ability and strength of character.

What other career opportunities are you looking at?

This will illustrate how well you have researched and thought through your chosen career area. It will also show an employer how much you really want the job. If you list a long series of unrelated career options, it will cast doubt on your motivation.

Where would you like to be in 5 or 10 years time?

Again, if you have a clear idea, it will show commitment and vision. If you do have some insight into where you are heading, think of some of the functions and responsibilities you would hope to have

When have you had to…..?

Employers want real evidence that clearly demonstrates you have particular skills. Draw up a list of key skills required for the position (found by dissecting the job ad, job description and personal specification) and highlight at least two situations or achievements that prove you have each skill. Practice talking through each example and present a concise, hard-hitting case. Avoid waffle and keep it sharp.

What would you do in ……..situation?

Situational questions are used to test your overall style and approach. Carefully prepare by listing all the roles you’ll potentially undertake in the new position and think up awkward questions yourself.

So, sell me this product.

Roel play questions really make you think on your feet. Once again, do your homework. Be prepared to demonstrate your skills in action.

What salary do you expect?

Work out a salary range you consider reasonable – job ads and job websites will give you an idea. Don’t undersell or oversell yourself. Give a range and indicate that you are prepared to negotiate.

How competent are you at ……?

Many employers now like to assess candidates using scoring grids with a work-based framework. This makes it important to quote practical examples showing your level of competence.

Are you pregnant/gay/etc?

Yes, it’s an outrageous question but always be on the alert for it. It may be designed to shock you and assess your reactions. It may equally reflect the fact that some employers lack formal training in interview techniques and fall back on crude stereotypes. Whatever the reason, it’s vital not to lose your cool – just write if off as ignorance.

You haven’t been much of a success so far, have you?

The aggressive approach may also throw you. The reasons could be the same but this time it is more likely to be a deliberate attempt to unnerve you. Again, keep your composure; it’s probably the reaction they are looking for.

Do you have any questions?

Always expect this one – so prepare a list. Include a few probing questions to show you’ve done your homework. Don’t be afraid to write them down and take them to the interview with you.

Other Questions which may be asked

* What brings you to the job market at this point in your career?
* Why would you like to work for this company in particular?
* What attracts you to this role?
* If you could change anything about your career so far, what would it be?
* How would members of your team describe you?
* What important points came out of your last appraisal?
* Describe your management style.
* What do you look for in a manager?
* Describe your toughest client.
* What do you want from your next role?
* What does success mean to you?
* What are the key things that drive or motivate you?
* Describe a difficult work scenario and how you managed it.

Questions to ask your interviewer

* How has this vacancy arisen?
* How would you describe the firm/company culture?
* What do you see as the key challenges of this role?
* How do you differentiate yourselves from your competitors?
* What are the organisation’s major business objectives in the coming year?
* How are employees measured in terms of performance?
* What processes exist to support employees in their career development?
* How would you describe the firm/company’s values?
* What key issues currently face the organisation?
* What can I expect to be involved in during my first six months of joining?
* What are the department’s priorities during the next six months?

Further Advice

Do not hesitate to ask your Recruitment Consultant for any additional advice, remember it is their job to assist you in getting your ideal job.

Wednesday, August 27, 2008

fwd 深入探讨数据仓库建模与ETL的实践技巧

更重要的是，在聚合时，数值型字段的匹配和比较，JOIN效率高，便于聚合。同时，代理键对缓慢变化维度有着重要的意义，在原数据主键相同的情况下，它起到了对新数据与历史数据的标识作用

扩展的星型模型
　　在数据仓库的数据库设计中,星型模型是一种基本的数据模式。星型模式是一种多维的数据关系,它由一个事实表(ＦａｃｔＴａｂｌｅ)和一组维表(ＤｉｍｅｎｓｉｏｎＴａｂｌｅ)组成。每个维表都有一个维作为主键,所有这些维则组合成事实表的主键,换言之,事实表主键的每个元素都是维表的外键。事实表的非主属性称为事实(Ｆａｃｔ),它们一般都是数值或其他可以进行计算的数据;而维大都是文字、时间等类型的数据。如图2所示为扩展的星型模型:
　　采用这种扩展的星型模型,多层分维结构减少了一级分维表的内容,避免一级分维表中出现大量的重复数据,使得复杂的数据模式保持简洁清晰。

深入探讨数据仓库建模与ETL的实践技巧

( 2008/8/5 10:56 )
本文关键字:

深入探讨了搭建数据仓库过程中应当遵循的方法和原则，更多内容请参考下文：

一、数据仓库的架构

数据仓库（Data Warehouse \ DW）是为了便于多维分析和多角度展现而将数据按特定的模式进行存储所建立起来的关系型数据库，它的数据基于OLTP源系统。数据仓库中的数据是细节的、集成的、面向主题的，以OLAP系统的分析需求为目的。

数据仓库的架构模型包括了星型架构（图二：pic2.bmp）与雪花型架构（图三：pic3.bmp）两种模式。如图所示，星型架构的中间为事实表，四周为维度表，类似星星；而相比较而言，雪花型架构的中间为事实表，两边的维度表可以再有其关联子表，从而表达了清晰的维度层次关系。

从OLAP系统的分析需求和ETL的处理效率两方面来考虑：星型结构聚合快，分析效率高；而雪花型结构明确，便于与OLTP系统交互。因此，在实际项目中，我们将综合运用星型架构与雪花型架构来设计数据仓库。

那么，下面我们就来看一看，构建企业级数据仓库的流程。

二、构建企业级数据仓库五步法

（一）、确定主题

即确定数据分析或前端展现的主题。例如：我们希望分析某年某月某一地区的啤酒销售情况，这就是一个主题。主题要体现出某一方面的各分析角度（维度）和统计数值型数据（量度）之间的关系，确定主题时要综合考虑。

我们可以形象的将一个主题想象为一颗星星：统计数值型数据（量度）存在于星星中间的事实表；分析角度（维度）是星星的各个角；我们将通过维度的组合，来考察量度。那么，“某年某月某一地区的啤酒销售情况”这样一个主题，就要求我们通过时间和地区两个维度的组合，来考察销售情况这个量度。从而，不同的主题来源于数据仓库中的不同子集，我们可以称之为数据集市。数据集市体现了数据仓库某一方面的信息，多个数据集市构成了数据仓库。

（二）、确定量度

在确定了主题以后，我们将考虑要分析的技术指标，诸如年销售额之类。它们一般为数值型数据。我们或者将该数据汇总，或者将该数据取次数、独立次数或取最大最小值等，这样的数据称为量度。

量度是要统计的指标，必须事先选择恰当，基于不同的量度可以进行复杂关键性能指标（KPI）等的设计和计算。

（三）、确定事实数据粒度

在确定了量度之后，我们要考虑到该量度的汇总情况和不同维度下量度的聚合情况。考虑到量度的聚合程度不同，我们将采用“最小粒度原则”，即将量度的粒度设置到最小。

例如：假设目前的数据最小记录到秒，即数据库中记录了每一秒的交易额。那么，如果我们可以确认，在将来的分析需求中，时间只需要精确到天就可以的话，我们就可以在ETL处理过程中，按天来汇总数据，此时，数据仓库中量度的粒度就是“天”；反过来，如果我们不能确认将来的分析需求在时间上是否需要精确到秒，那么，我们就需要遵循“最小粒度原则”，在数据仓库的事实表中保留每一秒的数据，以便日后对“秒”进行分析。

在采用“最小粒度原则”的同时，我们不必担心海量数据所带来的汇总分析效率问题，因为在后续建立多维分析模型（CUBE）的时候，我们会对数据提前进行汇总，从而保障产生分析结果的效率。关于建立多维分析模型（CUBE）的相关问题，我们将在下期栏目中予以阐述。

（四）、确定维度

维度是指分析的各个角度。例如我们希望按照时间，或者按照地区，或者按照产品进行分析，那么这里的时间、地区、产品就是相应的维度。基于不同的维度，我们可以看到各量度的汇总情况，也可以基于所有的维度进行交叉分析。

这里我们首先要确定维度的层次（Hierarchy）和级别（Level）（图四：pic4.bmp）。如图所示，我们在时间维度上，按照“年-季度-月”形成了一个层次，其中“年”、“季度”、“月”成为了这个层次的3个级别；同理，当我们建立产品维度时，我们可以将“产品大类-产品子类-产品”划为一个层次，其中包含“产品大类”、“产品子类”、“产品”三个级别。

那么，我们分析中所用到的这些维度，在数据仓库中的存在形式是怎样的呢？

我们可以将3个级别设置成一张数据表中的3个字段，比如时间维度；我们也可以使用三张表，分别保存产品大类、产品子类、产品三部分数据，比如产品维度。（图五：pic5.bmp）

另外，值得一提的是，我们在建立维度表时要充分使用代理键。代理键是数值型的ID号码（例如图六中每张表的第一个字段），它唯一标识了每一维度成员。更重要的是，在聚合时，数值型字段的匹配和比较，JOIN效率高，便于聚合。同时，代理键对缓慢变化维度有着重要的意义，在原数据主键相同的情况下，它起到了对新数据与历史数据的标识作用。

在此，我们不妨谈一谈维度表随时间变化的问题，这是我们经常会遇到的情况，我们称其为缓慢变化维度。

比如我们增加了新的产品，或者产品的ID号码修改了，或者产品增加了一个新的属性，此时，维度表就会被修改或者增加新的记录行。这样，我们在ETL的过程中，就要考虑到缓慢变化维度的处理。对于缓慢变化维度，有三种情况：

1、缓慢变化维度第一种类型：

历史数据需要修改。这种情况下，我们使用UPDATE方法来修改维度表中的数据。例如：产品的ID号码为123，后来发现ID号码错了，需要改写成456，那么，我们就在ETL处理时，直接修改维度表中原来的ID号码为456。

2、缓慢变化维度第二种类型：

历史数据保留，新增数据也要保留。这时，要将原数据更新，将新数据插入，我们使用UPDATE / INSERT。比如：某一员工2005年在A部门，2006年时他调到了B部门。那么在统计2005年的数据时就应该将该员工定位到A部门；而在统计2006年数据时就应该定位到B部门，然后再有新的数据插入时，将按照新部门（B部门）进行处理，这样我们的做法是将该维度成员列表加入标识列，将历史的数据标识为“过期”，将目前的数据标识为“当前的”。另一种方法是将该维度打上时间戳，即将历史数据生效的时间段作为它的一个属性，在与原始表匹配生成事实表时将按照时间段进行关联，这种方法的好处是该维度成员生效时间明确。

3、缓慢变化维度第三种类型：

新增数据维度成员改变了属性。例如：某一维度成员新加入了一列，该列在历史数据中不能基于它浏览，而在目前数据和将来数据中可以按照它浏览，那么此时我们需要改变维度表属性，即加入新的字段列。那么，我们将使用存储过程或程序生成新的维度属性，在后续的数据中将基于新的属性进行查看
五）、创建事实表

在确定好事实数据和维度后，我们将考虑加载事实表。

在公司的大量数据堆积如山时，我们想看看里面究竟是什么，结果发现里面是一笔笔生产记录，一笔笔交易记录… 那么这些记录是我们将要建立的事实表的原始数据，即关于某一主题的事实记录表。

我们的做法是将原始表与维度表进行关联，生成事实表（图六：pic6.bmp）。注意在关联时有为空的数据时（数据源脏），需要使用外连接，连接后我们将各维度的代理键取出放于事实表中，事实表除了各维度代理键外，还有各量度数据，这将来自原始表，事实表中将存在维度代理键和各量度，而不应该存在描述性信息，即符合“瘦高原则”，即要求事实表数据条数尽量多（粒度最小），而描述性信息尽量少。

如果考虑到扩展，可以将事实表加一唯一标识列，以为了以后扩展将该事实作为雪花型维度，不过不需要时一般建议不用这样做。

事实数据表是数据仓库的核心，需要精心维护，在JOIN后将得到事实数据表，一般记录条数都比较大，我们需要为其设置复合主键和索引，以实现数据的完整性和基于数据仓库的查询性能优化。事实数据表与维度表一起放于数据仓库中，如果前端需要连接数据仓库进行查询，我们还需要建立一些相关的中间汇总表或物化视图，以方便查询。

三、什么是ETL

在数据仓库的构建中，ETL贯穿于项目始终，它是整个数据仓库的生命线，包括了数据清洗、整合、转换、加载等各个过程。如果说数据仓库是一座大厦，那么ETL就是大厦的根基。ETL抽取整合数据的好坏直接影响到最终的结果展现。所以ETL在整个数据仓库项目中起着十分关键的作用，必须摆到十分重要的位置。

ETL是数据抽取（Extract）、转换（Transform）、加载（Load ）的简写，它是指：将OLTP系统中的数据抽取出来，并将不同数据源的数据进行转换和整合，得出一致性的数据，然后加载到数据仓库中。例如：下图就向我们展示了ETL的数据转换效果。

那么，在这一转换过程中，我们就完成了对数据格式的更正、对数据字段的合并、以及新增指标的计算三项操作。类似地，我们也可以根据其他需求，完善数据仓库中的数据。

简而言之，通过ETL，我们可以基于源系统中的数据来生成数据仓库。ETL为我们搭建了OLTP系统和OLAP系统之间的桥梁。

五、项目实践技巧

（一）、准备区的运用

在构建数据仓库时，如果数据源位于一台服务器上，数据仓库在另一台服务器端，考虑到数据源Server端访问频繁，并且数据量大，需要不断更新，所以可以建立准备区数据库（图八：pic8.bmp）。先将数据抽取到准备区中，然后基于准备区中的数据进行处理，这样处理的好处是防止了在原OLTP系统中频繁访问，进行数据运算或排序等操作。

例如我们可以按照天将数据抽取到准备区中，基于数据准备区，我们将进行数据的转换、整合、将不同数据源的数据进行一致性处理。数据准备区中将存在原始抽取表、转换中间表和临时表以及ETL日志表等。

（二）、时间戳的运用

时间维度对于某一事实主题来说十分重要，因为不同的时间有不同的统计数据信息，那么按照时间记录的信息将发挥很重要的作用。在ETL中，时间戳有其特殊的作用，在上面提到的缓慢变化维度中，我们可以使用时间戳标识维度成员；在记录数据库和数据仓库的操作时，我们也将使用时间戳标识信息。例如：在进行数据抽取时，我们将按照时间戳对OLTP系统中的数据进行抽取，比如在午夜0：00取前一天的数据，我们将按照OLTP系统中的时间戳取GETDATE到GETDATE减一天，这样得到前一天数据。

（三）、日志表的运用

在对数据进行处理时，难免会发生数据处理错误，产生出错信息，那么我们如何获得出错信息并及时修正呢? 方法是我们使用一张或多张Log日志表，将出错信息记录下来，在日志表中我们将记录每次抽取的条数、处理成功的条数、处理失败的条数、处理失败的数据、处理时间等等。这样，当数据发生错误时，我们很容易发现问题所在，然后对出错的数据进行修正或重新处理。

（四）、使用调度

在对数据仓库进行增量更新时必须使用调度（图九：pic9.bmp），即对事实数据表进行增量更新处理。在使用调度前要考虑到事实数据量，确定需要多长时间更新一次。比如希望按天进行查看，那么我们最好按天进行抽取，如果数据量不大，可以按照月或半年对数据进行更新。如果有缓慢变化维度情况，调度时需要考虑到维度表更新情况，在更新事实数据表之前要先更新维度表。

调度是数据仓库的关键环节，要考虑缜密。在ETL的流程搭建好后，要定期对其运行，所以调度是执行ETL流程的关键步骤。每一次调度除了写入Log日志表的数据处理信息外，还要使用发送Email或报警服务等，这样也方便的技术人员对ETL流程的把握，增强了安全性和数据处理的准确性。

五、总结

构建企业级数据仓库需要简单的五步，掌握了这五步的方法，我们可以构建一个强大的数据仓库。然而，每一步都有很深的内容需要研究与挖掘，尤其在实际项目中，我们要综合考虑。例如：如果数据源的脏数据很多，在搭建数据仓库之前我们首先要进行数据清洗，以剔除掉不需要的信息和脏数据。

ETL是OLTP系统和OLAP系统之间的桥梁，是数据从源系统流入数据仓库的通道。在数据仓库的项目实施中，它关系到整个项目的数据质量，所以马虎不得，必须将其摆到重要位置，将数据仓库这一大厦的根基筑牢！

***********

数据仓库系统中，一个很重要的目的就是保留数据的历史变化信息。而变化数据捕获（Change Data Capture，CDC）就是为这个目的而产生的一项技术。变化数据捕获常用的方法有：1）文件或者表的全扫描对比，2）DBMS日志获取，3）在源系统中增加触发器获取，4）基于源系统的时间戳获取，5）基于复制技术的获取，6）DBMS提供的变化数据捕获方法等。其中，由DBMS提供变化数据捕获的方法是大势所趋，即具体的捕获过程由DBMS来完成。

QQ08.net
创新性应用数据建模经验谈
2007-09-27 20:38:45 作者：来源：互联网文字大小：【大】【中】【小】
　　　　笔者从98年进入数据库及数据仓库领域工作至今已经有近八年的时间，对数据建模工作接触的比较多，创新性不敢谈，本文只是将工作中的经验总结出来，供大家一同探讨和指正。　　提起数据建模来，有一点是首先 ...
　　笔者从98年进入数据库及数据仓库领域工作至今已经有近八年的时间，对数据建模工作接触的比较多，创新性不敢谈，本文只是将工作中的经验总结出来，供大家一同探讨和指正。

　　提起数据建模来，有一点是首先要强调的，数据建模师和DBA有着较大的不同，对数据建模师来说，对业务的深刻理解是第一位的，不同的建模方法和技巧是为业务需求来服务的。而本文则暂时抛开业务不谈，主要关注于建模方法和技巧的经验总结。

　　从目前的数据库及数据仓库建模方法来说，主要分为四类。

　　第一类是大家最为熟悉的关系数据库的三范式建模，通常我们将三范式建模方法用于建立各种操作型数据库系统。

　　第二类是Inmon提倡的三范式数据仓库建模，它和操作型数据库系统的三范式建模在侧重点上有些不同。Inmon的数据仓库建模方法分为三层，第一层是实体关系层，也即企业的业务数据模型层，在这一层上和企业的操作型数据库系统建模方法是相同的；第二层是数据项集层，在这一层的建模方法根据数据的产生频率及访问频率等因素与企业的操作型数据库系统的建模方法产生了不同；第三层物理层是第二层的具体实现。

　　第三类是Kimball提倡的数据仓库的维度建模，我们一般也称之为星型结构建模，有时也加入一些雪花模型在里面。维度建模是一种面向用户需求的、容易理解的、访问效率高的建模方法，也是笔者比较喜欢的一种建模方式。

　　第四类是更为灵活的一种建模方式，通常用于后台的数据准备区，建模的方式不拘一格，以能满足需要为目的，建好的表不对用户提供接口，多为临时表。

　　下面简单谈谈第四类建模方法的一些的经验。

　　数据准备区有一个最大的特点，就是不会直接面对用户，所以对数据准备区中的表进行操作的人只有ETL工程师。ETL工程师可以自己来决定表中数据的范围和数据的生命周期。下面举两个例子：

　　1）数据范围小的临时表

　　当需要整合或清洗的数据量过大时，我们可以建立同样结构的临时表，在临时表中只保留我们需要处理的部分数据。这样，不论是更新还是对表中某些项的计算都会效率提高很多。处理好的数据发送入准备加载到数据仓库中的表中，最后一次性加载入数据仓库。

　　2）带有冗余字段的临时表

　　由于数据准备区中的表只有自己使用，所以建立冗余字段可以起到很好的作用而不用承担风险。

　　举例来说，笔者在项目中曾遇到这样的需求，客户表{客户ID，客户净扣值}，债项表{债项ID，客户ID，债项余额，债项净扣值}，即客户和债项是一对多的关系。其中，客户净扣值和债项余额已知，需要计算债项净扣值。计算的规则是按债项余额的比例分配客户的净扣值。这时，我们可以给两个表增加几个冗余字段，如客户表{客户ID，客户净扣值，客户余额}，债项表{债项ID，客户ID，债项余额，债项净扣值，客户余额，客户净扣值}。这样通过三条SQL就可以直接完成整个计算过程。将债项余额汇总到客户余额，将客户余额和客户净扣值冗余到债项表中，在债项表中通过（债项余额×客户净扣值/客户余额）公式即可直接计算处债项净扣值。

　　另外还有很多大家可以发挥的建表方式，如不需要主键的临时表等等。总结来说，正因为数据准备区是不对用户提供接口的，所以我们一定要利用好这一点，以给我们的数据处理工作带来最大的便利为目的来进行数据准备区的表设计。

　　行业借鉴经验：

　　数据仓库架构经验谈

　　对于数据仓库的架构方法，不同的架构师有不同的原则和方法，笔者在这里来总结一下当前常采用的架构方式及其优缺点。这些架构方式不限于某个行业，可以供各个行业借鉴使用。

　　首先需要说明的一点是，目前在数据仓库领域比较一致的意见是在数据仓库中需要保留企业范围内一致的原子层数据。而独立的数据集市架构（Independent data marts）没有企业范围内一致的数据，很可能会导致信息孤岛的产生，除非在很小的企业内或只针对固定主题，否则不建议建立这样的架构方式。联邦式的数据仓库架构（Federated Data Warehouse Architecture）不管是在地域上的联邦还是功能上的联邦都需要先在不同平台上建立各自的数据仓库，再通过参考（reference）数据来实现整合，而这样很容易造成整合的不彻底，除非联邦式的数据仓库架构也采用Kimball的总线架构（Bus Architecture）中类似的功能，即在数据准备区保留一致性维度（Conformed Table）并不断更新它。所以，这两种架构方式不在讨论范围之内。下面主要讨论剩下的三种架构方式。

　　1）三范式（3NF）的原子层＋数据集市

　　这样的数据仓库架构最大的倡导者就是数据仓库之父Inmon，而他的企业信息工厂（Corporate Information System）就是典型的代表。这样的架构也称之为企业数据仓库（Enterprise Data Warehouse，EDW）。企业信息工厂的实现方式是，首先进行全企业的数据整合，建立企业信息模型，即EDW。对于各种分析需求再建立相应的数据集市或者探索仓库，其数据来源于EDW。三范式的原子层给建立OLAP带来一定的复杂性，但是对于建立更复杂的应用，如挖掘仓库、探索仓库提供了更好的支持。这类架构的建设周期比较长，相应的成本也比较高。

　　2）星型结构（Star Schema）的原子层＋HOLAP

　　星型结构最大的倡导者是Kimall，他的总线架构是该类架构的典型代表。总线架构实现方式是，首先在数据准备区中建立一致性维度、建立一致性事实的计算方法；其次在一致性维度、一致性事实的基础上逐步建立数据集市。每次增加数据集市，都会在数据准备区整合一致性维度，并将整合好的一致性维度同步更新到所有的数据集市。这样，建立的所有数据集市合在一起就是一个整合好的数据仓库。正是因为总线架构这个可以逐步建立的特点，它的开发周期比其他架构方式的开发周期要短，相应的成本也要低。在星型结构的原子层上可以直接建立聚集，也可以建立HOLAP。笔者比较倾向于Kimball的星型结构的原子层架构，在这种架构中的经验也比较多。

　　3）三范式（3NF）的原子层＋ROLAP

　　这样的数据仓库架构也称为集中式架构（Centralized Architecture），思路是在三范式的原子层上直接建立ROLAP，做的比较出色的就是MicroStrategy。在三范式的原子层上定义ROLAP比在星型结构的原子层上定义ROLAP要复杂很多。采用这种架构需要在定义ROLAP是多下些功夫，而且ROLAP的元数据不一定是通用的格式，所以对ROLAP做展现很可能会受到工具的局限。这类架构和第一类很相似，只是少了原子层上的数据集市。

　　总结来说，这三种数据仓库的架构方式都是不错的选择。对于需要见效快、成本低的项目可以考虑采用第二种总线架构，对于资金充足并有成熟业务数据模型的企业可以考虑采用第一种架构或第三种架构。

　　应用难点技巧：

　　变化数据捕获经验谈

　　在数据仓库系统中，一个很重要的目的就是保留数据的历史变化信息。而变化数据捕获（Change Data Capture，CDC）就是为这个目的而产生的一项技术。变化数据捕获常用的方法有：1）文件或者表的全扫描对比，2）DBMS日志获取，3）在源系统中增加触发器获取，4）基于源系统的时间戳获取，5）基于复制技术的获取，6）DBMS提供的变化数据捕获方法等。其中，由DBMS提供变化数据捕获的方法是大势所趋，即具体的捕获过程由DBMS来完成。

　　像银行、电信等很多行业的操作记录生成后就不会改变，只有像客户、产品等信息会随时间发生缓慢的变化，所以通常的变化数据捕获是针对维度表而言的。Kimball对缓慢变化维的分析及应对策略基本上可以处理维度表的各种变化。

　　而对于一些零售行业，像合同表中的合同金额类似的数值在录入后是有可能会发生改变的，也就是说事实表的数据也有可能发生变化。通常对于事实表数据的修改属于勘误的范畴，可以采用类似缓慢变化维TYPE 1的处理方式直接更新事实表。笔者不太赞同对事实表的变化采用快照的方式插入一条新的事实勘误记录，这样会给后续的展现、分析程序带来太多的麻烦。

　　接下来要讨论的是笔者曾经遇到的一个颇为棘手的事实表数据改变的问题，该事实表的主键随表中某些数据的变化发生改变。以其中的一个合同表为例，该合同表的主键是由“供货单位编号”＋“合同号”生成的智能主键，当其中的“供货单位编号”和“合同号”中任何一个发生变化时，该合同表的主键都会发生变化，给变化数据捕获带来了很大的麻烦。

　　项目中，笔者的处理方式是采用触发器的办法来实现变化数据捕获。具体的实现方式是：

　　1）建立一个新表作为保存捕获的数据表使用，其中字段有“原主键”、“修改后主键”、及其他需要的字段，称为“合同捕获表”。

　　2）在原合同表Delete和Update时分别建立触发器，当删除操作发生时，建在Delete上的触发器会插入一条记录到“合同捕获表”，其中“修改后主键”字段为空，表示该记录是删除的记录；当发生更新时，将“原主键”、“修改后主键”及其他需要记录的字段都保存入“合同捕获表”中，表示该记录被修改过，如果“原主键”和“修改后主键”不同，则表示主键被修改，如果相同，则表示主键没有被修改。

　　3）由于操作系统中的主键通常会成为数据仓库中事实表的退化维度，可能仍会起主键的作用。所以在数据加载时，需要分情况判断“合同捕获表”的数据来决定是否更新事实表中的退化维度。

　　可以说，这样的基于触发器的变化数据捕获方法并不是一个很好的选择。首先这需要对源系统有较大的权限；其次，触发器会给源系统的性能带来很大的影响。所以除非是没有别的选择，否则不建议采用这种方法。

　　而对于这样的情况，我们在建立操作型数据库系统时完全可以避免。下面是对操作型数据库系统建立者的几点建议：1）操作型系统的主键不要建立成智能型的，至少不要建立成会变化的。2）操作型系统的表中需要加入操作人和操作时间字段，或者直接加入时间戳。3）操作型系统中操作型数据最好不要直接在原值上修改，可以采用“冲红”的方式加入新的记录。这样后续建立数据仓库时就不需要考虑事实表数据的变化问题。

　　最后，期待各大数据库管理系统的厂商能尽快在DBMS层提供功能强大、简单好用的变化数据捕获功能，目前Oracle已经有了这个功能。毕竟技术方面复杂的事情留给厂商做是一个趋势，而我们做应用的则更关注于业务

即在数据准备区保留一致性维度（Conformed Table）并不断更新它。所以，

Dimension Design Best Practices
Good dimension design is the most important aspect of a well designed Analysis Services OLAP database. Although the wizards in Analysis Services do much of the work to get you started, it is important to review the design that is created by the wizard and ensure that the attributes, relationships, and hierarchies correctly reflect the data and match the needs of your end-users.

Do create attribute relationships wherever they exist in the data
Attribute relationships are an important part of dimension design. They help the server optimize storage of data, define referential integrity rules within the dimension, control the presence of member properties, and determine how MDX restrictions on one hierarchy affect the values in another hierarchy. For these reasons, it is important to spend some time defining attribute relationships that accurately reflect relationships in the data.

Avoid creating attributes that will not be used
Attributes add to the complexity and storage requirements of a dimension, and the number of attributes in a dimension can significantly affect performance. This is especially of attributes which have AttributeHierachyEnabled set to True. Although SQL Server 2005 Analysis Services can support many attributes in a dimension, having more attributes than are actually used decreases performance unnecessarily and can make the end-user experience more difficult.

It is usually not necessary to create an attribute for every column in a table. Even though the wizards do this by default in Analysis Services 2005, a better design approach is to start with the attributes you know you'll need, and later add more attributes. Adding attributes as you discover they are needed is generally better a better practice than adding everything and then removing attributes.

Do not create hierarchies where an attribute of a lower level contains fewer members than an attribute of the level above
A hierarchy such as this is frequently an indication that your levels are in the incorrect order: for example, [City] above [State]. It might also indicate that the key columns of the lower level are missing a column: for example, [Year] above [Quarter Number] instead of [Year] above [Quarter with Year]. Either of these situations will lead to confusion for end-users trying to use and understand the cube.

Do not include more than one non-aggregatable attribute per dimension
Because there is no All member, each non-aggregatable attribute will always have some non-all member selected, even if not specified in a query. Therefore, if you include multiple non-aggregatable attributes in a dimension, the selected attributes will conflict and produce unexpected numbers.

For example, in a time dimension it might not make sense to sum the members of [Calendar Year] or [Fiscal Year], but if both are made non-aggregatable, whenever a user asks for data for a specific [Calendar Year] it will be filtered by the default [Fiscal Year] unless they also specify the [Fiscal Year]. Worse, because [Calendar Year] and [Fiscal Year] do not align but overlap, it is difficult to obtain the full data for either a [Calendar Year] or a [Fiscal Year] because the one is filtered by the other.

Do use key columns that completely and correctly define the uniqueness of the members in an attribute
Usually a single key column is sufficient, but sometimes multiple key columns are necessary to uniquely identify members of an attribute. For example, it is common in time dimensions to have a [Month] attribute include both [Year] and [Month Name] as key columns. This is known as a composite key and identifies January of 1997 as being a different member than January of 1998. When you use [Month] in a time hierarchy that also contains [Year], this distinction between January of 1997 and January of 1998 is important.

It may also make sense to have a separate [Month of Year] attribute that has only [Month Name] as the key. This [Month of Year] attribute contains a single January member that spans all years, which can be useful for comparing seasonal data. However, this attribute should not be used in a hierarchy together with [Year] because there is no relationship between [Month of Year] and [Year].

Similar distinctions between [Quarter] and [Quarter of Year], [Semester] and [Semester of Year], and so on should also be made by setting appropriate key columns.

Do perform Process Index after doing a Process Update if the dimension contains flexible AttributeRelationships or a parent-child hierarchy
An aggregation is considered flexible if any attribute included in the aggregation is related, either directly or indirectly, to the key of its dimension through an AttributeRelationship with RelationshipType set to Flexible. Aggregations that include parent-child hierarchies are also considered flexible.

When a dimension is processed by using the Process Update option, any flexible aggregations that the dimension participates in might be dropped, depending on the contents of the new dimension data. These aggregations are not rebuilt by default, so Process Index must then be explicitly performed to rebuild them.

Do use numeric keys for attributes that contain many members (>1 million)
Using a numeric key column instead of a string key column or a composite key will improve the performance of attributes that contain many members. This best practice is based on the same concept as using surrogate keys in relational tables for more efficient indexing. You can specify the numeric surrogate column as the key column and still use a string column as the name column so that the attribute members appear the same to end-users. As a guideline, if the attribute has more than one million members, you should consider using a numeric key.

Do not create redundant attribute relationships
Do not create attribute relationships that are transitively implied by other attribute relationships. The alternative paths created by these redundant attribute relationships can cause problems for the server and are of no benefit to the dimension. For example, if the relationships A->B, B->C, and A->C have been created, A->C is redundant and should be removed.

Do include the key columns of snowflake tables joined to nullable foreign keys as attributes that have NullProcessing set to UnknownMember
If tables that are used in a dimension are joined on a foreign key column that might contain nulls, it is important that you include in your design an attribute whose key column is the corresponding key in the lookup table. Without such an attribute, the OLAP server would have to issue a query to join the two tables during dimension processing. This makes processing slower; moreover, the default join that is created by the OLAP server would exclude any rows that contain nulls in the foreign key column. It is important to set the NullProcessing option on the key column of this attribute to UnknownMember. The reason is that, by default, nulls are converted to zeros or blanks when the engine processes attributes. This can be dangerous when you are processing a nullable foreign key. Conversion of a null to zero at best produces an error; in the worst case, the zero may be a legitimate value in the lookup table, thereby producing incorrect results.

To handle nullable foreign keys correctly, you must also set UnknownMember to Visible on the dimension. The Cube Wizard and Dimension Wizard currently set this property automatically; however, the Dimension Wizard lets you manually de-select the key attribute of snowflake tables. You must not deselect the key column if the corresponding foreign key is nullable.

If you do not want to browse the attribute that contains the lookup table key column, you can set AttributeHierarchyVisible to False. However, AttributeHierarchyEnabled must be set to True because it is necessary that all other attributes in the lookup table be directly or indirectly related to the lookup key attribute in order to avoid the automatic creation of new joins during dimension processing.

Do set the RelationshipType property appropriately on AttributeRelationships based on whether the relationships between individual members change over time
The relationships between members of some attributes, such as dates in a given month or the gender of a customer, are not expected to change. Other relationships, such as salespeople in a given region or the marital status of a customer, are more prone to change over time. You should set RelationshipType to Flexible for those relationships that are expected to change and set RelationshipType to Rigid for relationships that are not expected to change.

When you set RelationshipType appropriately, the server can optimize the processing of changes and re-building of aggregations.

By default, the user interface always sets RelationshipType to Flexible.

Avoid using ErrorConfigurations with KeyDuplicate set to IgnoreError on dimensions
When KeyDuplicate is set to IgnoreError, it can be difficult to detect problems with incorrect key columns, incorrectly defined AttributeRelationships, and data consistency issues. Instead of using the IgnoreError option, in most cases it is better to correct your design and clean the data. The IgnoreError option may be useful in prototypes where correctness is less of a concern. Be aware that the default value for KeyDuplicate is IgnoreError. Therefore, it is important to change this value after prototyping is complete to ensure data consistency.

Do define explicit default members for non-aggregatable attributes
By default, the All member is used as the default member for aggregatable attributes. This default works very well for aggregatable attributes, but non-aggregatable attributes have no obvious choice for the server to use as a default member, therefore a member will be selected arbitrarily. This arbitrarily selected member is then selected whenever the attribute is not explicitly included in an MDX query. To avoid this, it is important to explicitly set a default value for each non-aggregatable attribute.

Default members can be explicitly set either on the DimensionAttribute or in the cube script.

Avoid creating user-defined hierarchies that do not have attribute relationships relating each level to the level above
Having attribute relationships between every level in a hierarchy makes the hierarchy strong and enables significant server optimizations.

Avoid creating diamond-shaped attribute relationships
A Diamond-shaped relationship refers to a chain of attribute relationships that splits and rejoins but contains no redundant relationships. For example, Day->Month->Year and Day->Quarter->Year have the same start and end points, but do not have any common relationships. The presence of multiple paths can create some ambiguity on the server. If preserving the multiple paths is important, it is strongly recommended that you resolve the ambiguity by creating user hierarchies that contain all the paths.

Consider setting AttributeHierarchyEnabled to False on attributes that have cardinality that closely matches the key attribute
When an attribute contains roughly one value for each distinct value of the key attribute, it usually means that the attribute contains only alternative identification information or secondary details. Such attributes are usually not interesting to pivot or group by. For example, the Social Security number or telephone number may be interesting properties to view, but there is very little value in being able to pivot and group based on SSN or telephone. Setting AttributeHierarchyEnabled to False on such attributes will reduce the complexity of the dimension for end-users and improve its performance.

If you want to be able to browse such attributes, you can set AttributeHierarchyEnabled to True; however, you should consider setting AttributeHierarchyOptimized to NotOptimized and setting GroupingBehavior to DiscourageGrouping. By setting these properties, you can improve performance and indicate to the users that the attribute is not very useful for grouping.

Consider setting AttributeHierarchyVisible to False on the key attribute of parent-child dimensions
Because the members of the key attribute are also contained in the parent-child hierarchy in a more organized manner, it is usually unnecessary and confusing to the end-user to expose the flat list of members contained in the key attribute.

Avoid setting UnknownMember=Hidden
When you suppress unknown members, the effect is to hide relational integrity issues; moreover, because hidden members might contain data, results might appear not to add up. Therefore, we recommend that you avoid use of this setting except in prototype applications.

Do use MOLAP storage mode for dimensions with outline calculations (custom rollups, semi-additive measures, and unary operators)
Dimensions that contain custom rollups or unary operators will perform significantly better using MOLAP storage. The following dimension types will also benefit from using MOLAP storage: an Account dimension in a measure group that contains measures aggregated using ByAccount; the first time dimension in a measure group that contains other semi-additive measures.

Do use a 64 bit server if you have dimensions with more than 10 million members
If a dimension contains more than 10 million members, using an x64 or an IA-64-based server is recommended for better performance.

Do set the OrderBy property for time attributes and other attributes whose natural ordering is not alphabetical
By default, the server orders attribute members alphabetically, by name. This ordering is especially undesirable for time attributes. To obtain the desired ordering, use the OrderBy and OrderByAttributes properties and explicitly specify how you want the members ordered. For time-based attributes, there is frequently a date or numeric key column that can be used to obtain the correct chronological ordering.

Do expose a DateTime MemberValue for date attributes
Some clients, such as Excel, will take advantage of the MemberValue property of date members and use the DateTime value that is exposed. When Excel recognizes the value as DateTime, Excel can treat the value as a date type and apply date functions to the value, as well as provide better formatting and filtering. If the key column is a single DateTime column and the name column has not been set, this MemberValue is automatically derived from the key column and no action is necessary. However, in other cases, you can ensure that the MemberValue is DateTime by explicitly specifying the ValueColumn property of the attribute.

Do set AttributeHierarchyEnabled to False, specify a ValueColumn and specify the MimeType of the ValueColumn on attributes that contain images
Because there is no value in browsing the member names of an attribute that contains an image, you should disable browsing by setting AttributeHierarchyEnabled to False. To help clients recognize and display the member property of the attribute as an image, specify the ValueColumn property of the attribute and then set MimeType to an appropriate image type.

Avoid setting IsAggregatable to False on any attribute other than the parent attribute in a parent-child dimension
Non-aggregatable attributes have non-all default members. These default members affect the result of queries whenever the attributes are not explicitly included. Because parent-child hierarchies generally represent the most interesting exploration path in dimensions that contain them, it is best to avoid having non-aggregatable attributes other than the parent attribute.

Do set dimension and attribute Type properties correctly for Time, Account, and Geography dimensions
For time dimensions, it is important to set the dimension and attribute types correctly so that time-related MDX functions and the time intelligence of the Business Intelligence Wizard can work correctly. For Account dimensions, it is similarly important to set appropriate account types when using measures with the aggregate function ByAccount. Geography types are not used by the server, but provide information for client applications.

A common mistake is to set the Type property on a dimension but not on an attribute, or vice-versa. Another common mistake when configuring time dimensions is to confuse the different time attribute types, such as [Month] and [Month of Year].

Consider creating user-defined hierarchies whenever you have a chain of related attributes in a dimension
Chains of related attributes usually represent an interesting navigation path for end-users, and defining hierarchies for these will also provide performance benefits.

Do include all desired attributes of a logical business entity in a single dimension instead of splitting them up over several dimensions
In Analysis Services 2000, each hierarchy was in reality a separate dimension and attributes such as gender and age would also be separate dimensions. In Analysis Services 2005, a dimension can and should contain the complete information about a logical business entity, including multiple hierarchies and many attributes. This does not mean that every piece of information available must be included in the dimension, but rather that any desired information should be included in one dimension instead of split over many dimensions.

There are two exceptions to this guideline:

1.
A dimension can only contain one parent-child hierarchy.

2.
To model multiple joins to a lookup table within a dimension's schema, you must create a separate dimension based on the lookup table and then use this as a referenced dimension.

Do not combine unrelated business entities into a single dimension
Combining attributes of independent business entities, such as customer and product or warehouse and time, into a single dimension will not only create a confusing model, but also reduce query performance because auto-exist will be applied across attributes within the dimension.

Another way to state this rule is that the values of the key attribute of a dimension should uniquely identify a single business entity and not a combination of entities. Generally this means having a single column key for the key attribute.

Do set NullProcessing to UnknownMember on each attribute that has nulls and is used to join to a referenced dimension
By default, nulls are converted to zeros or blanks when the engine processes attributes. This can be dangerous when processing a nullable foreign key, because if a null is converted to zero when zero is a legitimate value in the reference dimension, the join on the values can produce incorrect results. At best, conversion to zero will produce an error.

To prevent these errors, you must also set UnknownMember to Visible on the referenced dimension.

The Cube Wizard in SQL Server 2005 Analysis Services handles both settings automatically, except when dealing with existing dimensions where UnknownMember is not set to Visible.

Do set NullKeyConvertToUnknown to IgnoreError on the ErrorConfiguration on any measure groups that contain a dimension referenced through a nullable column
By default, nulls are converted to zeros or blanks when the engine processes granularity attributes. This can be dangerous when you are processing a nullable foreign key, because if a null value is converted to zero and zero is a legitimate value in the dimension, the join can produce incorrect results. At best, the conversion will produce errors.

To prevent conversion of nulls, you must also set UnknownMember to Visible on the dimension.

The Cube Wizard in SQL Server 2005 Analysis Services handles these settings automatically, except when dealing with existing dimensions where UnknownMember is not set to Visible.

Consider setting AttributeHierarchyVisible to False for attributes included in user-defined hierarchies
It is usually not necessary to expose an attribute in its own single level hierarchy when that attribute is included in a user-defined hierarchy. This duplication only complicates the end-user experience without providing additional value.

One common case in which it is appropriate to present two views of an attribute is in time dimensions. The ability to browse by [Month] and the ability to browse by [Month-Quarter-Year] are both very valuable. However, these two month attributes are actually separate attributes. The first contains only the month value such as “January” while the second contains the month and the year such as “January 1998”.

Do not use proactive caching settings that put dimensions into ROLAP mode
For performance reasons, we strongly discourage the use of dimension proactive caching settings that may put the dimension in ROLAP mode. To ensure that a dimension with proactive caching enabled will never enter ROLAP mode, you should set the OnlineMode property to OnCacheComplete. You can also prevent use of ROLAP mode by deselecting the Bring online immediately check box in the Storage Options dialog box.

Avoid making an attribute non-aggregatable unless it is at the end of the longest chain of attribute relationships in the dimension
Non-aggregatable attributes have non-all default members that affect the result of queries in which values for those attributes are not explicitly specified. Therefore, you should avoid making an attribute non-aggregatable unless that attribute is regularly used. Because the longest chain of attributes generally represents the most interesting exploration path for users, it is best to avoid having non-aggregatable attributes in other, less interesting chains.

Consider creating at least one user-defined hierarchy in each dimension that does not contain a parent-child hierarchy
Most (but not all) dimensions contain some hierarchical structure to the data which is worth exposing in the cube. Frequently the Cube Wizard or Dimension Wizard will not detect this hierarchy. In these cases, you should define a hierarchy manually.

Do set the InstanceSelection property on attributes to help clients determine the best way to display attributes for member selection
If there are too many members to display in a single list, the client user interface can use other methods, such as filtered lists, to display the members. By setting the InstanceSelection property, you provide a hint to client applications to suggest how a list of items should be displayed, based on the expected number of items in the list.

代理键应该是人造的
In a temporal database, it is necessary to distinguish between the surrogate key and the primary key. Typically, every row would have both a primary key and a surrogate key. The primary key identifies the unique row in the database, the surrogate key identifies the unique entity in the modelled world; these two keys are not the same. For example, table Staff may contain two rows for "John Smith", one row when he was employed between 1990 and 1999, another row when he was employed between 2001 and 2006. The surrogate key is identical (non-unique) in both rows however the primary key will be unique.

微软总部Microsoft Marketing数据分析项目等
笔者在实际工作中，有幸接触到海量的数据处理问题，对其进行处理是一项艰巨而复杂的任务。原因有以下几个方面：
一、数据量过大，数据中什么情况都可能存在。如果说有10条数据，那么大不了每条去逐一检查，人为处理，如果有上百条数据，也可以考虑，如果数据上到千万级别，甚至过亿，那不是手工能解决的了，必须通过工具或者程序进行处理，尤其海量的数据中，什么情况都可能存在，例如，数据中某处格式出了问题，尤其在程序处理时，前面还能正常处理，突然到了某个地方问题出现了，程序终止了。
二、软硬件要求高，系统资源占用率高。对海量的数据进行处理，除了好的方法，最重要的就是合理使用工具，合理分配系统资源。一般情况，如果处理的数据过TB级，小型机是要考虑的，普通的机子如果有好的方法可以考虑，不过也必须加大CPU和内存，就象面对着千军万马，光有勇气没有一兵一卒是很难取胜的。
三、要求很高的处理方法和技巧。这也是本文的写作目的所在，好的处理方法是一位工程师长期工作经验的积累，也是个人的经验的总结。没有通用的处理方法，但有通用的原理和规则。
那么处理海量数据有哪些经验和技巧呢，我把我所知道的罗列一下，以供大家参考：
一、选用优秀的数据库工具
现在的数据库工具厂家比较多，对海量数据的处理对所使用的数据库工具要求比较高，一般使用Oracle或者DB2，微软公司最近发布的SQL Server 2005性能也不错。另外在BI领域：数据库，数据仓库，多维数据库，数据挖掘等相关工具也要进行选择，象好的ETL工具和好的OLAP工具都十分必要，例如Informatic，Eassbase等。笔者在实际数据分析项目中，对每天6000万条的日志数据进行处理，使用SQL Server 2000需要花费6小时，而使用SQL Server 2005则只需要花费3小时。
二、编写优良的程序代码
处理数据离不开优秀的程序代码，尤其在进行复杂数据处理时，必须使用程序。好的程序代码对数据的处理至关重要，这不仅仅是数据处理准确度的问题，更是数据处理效率的问题。良好的程序代码应该包含好的算法，包含好的处理流程，包含好的效率，包含好的异常处理机制等。
三、对海量数据进行分区操作
对海量数据进行分区操作十分必要，例如针对按年份存取的数据，我们可以按年进行分区，不同的数据库有不同的分区方式，不过处理机制大体相同。例如SQL Server的数据库分区是将不同的数据存于不同的文件组下，而不同的文件组存于不同的磁盘分区下，这样将数据分散开，减小磁盘I/O，减小了系统负荷，而且还可以将日志，索引等放于不同的分区下。
四、建立广泛的索引
对海量的数据处理，对大表建立索引是必行的，建立索引要考虑到具体情况，例如针对大表的分组、排序等字段，都要建立相应索引，一般还可以建立复合索引，对经常插入的表则建立索引时要小心，笔者在处理数据时，曾经在一个ETL流程中，当插入表时，首先删除索引，然后插入完毕，建立索引，并实施聚合操作，聚合完成后，再次插入前还是删除索引，所以索引要用到好的时机，索引的填充因子和聚集、非聚集索引都要考虑。
五、建立缓存机制
当数据量增加时，一般的处理工具都要考虑到缓存问题。缓存大小设置的好差也关系到数据处理的成败，例如，笔者在处理2亿条数据聚合操作时，缓存设置为100000条/Buffer，这对于这个级别的数据量是可行的。
六、加大虚拟内存
如果系统资源有限，内存提示不足，则可以靠增加虚拟内存来解决。笔者在实际项目中曾经遇到针对18亿条的数据进行处理，内存为1GB，1个P4 2.4G的CPU，对这么大的数据量进行聚合操作是有问题的，提示内存不足，那么采用了加大虚拟内存的方法来解决，在6块磁盘分区上分别建立了6个4096M的磁盘分区，用于虚拟内存，这样虚拟的内存则增加为 4096*6 + 1024 = 25600 M，解决了数据处理中的内存不足问题。
七、分批处理
海量数据处理难因为数据量大，那么解决海量数据处理难的问题其中一个技巧是减少数据量。可以对海量数据分批处理，然后处理后的数据再进行合并操作，这样逐个击破，有利于小数据量的处理，不至于面对大数据量带来的问题，不过这种方法也要因时因势进行，如果不允许拆分数据，还需要另想办法。不过一般的数据按天、按月、按年等存储的，都可以采用先分后合的方法，对数据进行分开处理。
八、使用临时表和中间表
数据量增加时，处理中要考虑提前汇总。这样做的目的是化整为零，大表变小表，分块处理完成后，再利用一定的规则进行合并，处理过程中的临时表的使用和中间结果的保存都非常重要，如果对于超海量的数据，大表处理不了，只能拆分为多个小表。如果处理过程中需要多步汇总操作，可按汇总步骤一步步来，不要一条语句完成，一口气吃掉一个胖子。
九、优化查询SQL语句
在对海量数据进行查询处理过程中，查询的SQL语句的性能对查询效率的影响是非常大的，编写高效优良的SQL脚本和存储过程是数据库工作人员的职责，也是检验数据库工作人员水平的一个标准，在对SQL语句的编写过程中，例如减少关联，少用或不用游标，设计好高效的数据库表结构等都十分必要。笔者在工作中试着对1亿行的数据使用游标，运行3个小时没有出结果，这是一定要改用程序处理了。
十、使用文本格式进行处理
对一般的数据处理可以使用数据库，如果对复杂的数据处理，必须借助程序，那么在程序操作数据库和程序操作文本之间选择，是一定要选择程序操作文本的，原因为：程序操作文本速度快；对文本进行处理不容易出错；文本的存储不受限制等。例如一般的海量的网络日志都是文本格式或者csv格式（文本格式），对它进行处理牵扯到数据清洗，是要利用程序进行处理的，而不建议导入数据库再做清洗。
十一、定制强大的清洗规则和出错处理机制
海量数据中存在着不一致性，极有可能出现某处的瑕疵。例如，同样的数据中的时间字段，有的可能为非标准的时间，出现的原因可能为应用程序的错误，系统的错误等，这是在进行数据处理时，必须制定强大的数据清洗规则和出错处理机制。
十二、建立视图或者物化视图
视图中的数据来源于基表，对海量数据的处理，可以将数据按一定的规则分散到各个基表中，查询或处理过程中可以基于视图进行，这样分散了磁盘I/O，正如10根绳子吊着一根柱子和一根吊着一根柱子的区别。
十三、避免使用32位机子（极端情况）
目前的计算机很多都是32位的，那么编写的程序对内存的需要便受限制，而很多的海量数据处理是必须大量消耗内存的，这便要求更好性能的机子，其中对位数的限制也十分重要。
十四、考虑操作系统问题
海量数据处理过程中，除了对数据库，处理程序等要求比较高以外，对操作系统的要求也放到了重要的位置，一般是必须使用服务器的，而且对系统的安全性和稳定性等要求也比较高。尤其对操作系统自身的缓存机制，临时空间的处理等问题都需要综合考虑。
十五、使用数据仓库和多维数据库存储
数据量加大是一定要考虑OLAP的，传统的报表可能5、6个小时出来结果，而基于Cube的查询可能只需要几分钟，因此处理海量数据的利器是OLAP多维分析，即建立数据仓库，建立多维数据集，基于多维数据集进行报表展现和数据挖掘等。
十六、使用采样数据，进行数据挖掘
基于海量数据的数据挖掘正在逐步兴起，面对着超海量的数据，一般的挖掘软件或算法往往采用数据抽样的方式进行处理，这样的误差不会很高，大大提高了处理效率和处理的成功率。一般采样时要注意数据的完整性和，防止过大的偏差。笔者曾经对1亿2千万行的表数据进行采样，抽取出400万行，经测试软件测试处理的误差为千分之五，客户可以接受。
还有一些方法，需要在不同的情况和场合下运用，例如使用代理键等操作，这样的好处是加快了聚合时间，因为对数值型的聚合比对字符型的聚合快得多。类似的情况需要针对不同的需求进行处理。
海量数据是发展趋势，对数据分析和挖掘也越来越重要，从海量数据中提取有用信息重要而紧迫，这便要求处理要准确，精度要高，而且处理时间要短，得到有价值信息要快，所以，对海量数据的研究很有前途，也很值得进行广泛深入的研究

创新性应用数据建模经验谈
2007-09-27 20:38:45 作者：来源：互联网文字大小：【大】【中】【小】
　　　　笔者从98年进入数据库及数据仓库领域工作至今已经有近八年的时间，对数据建模工作接触的比较多，创新性不敢谈，本文只是将工作中的经验总结出来，供大家一同探讨和指正。　　提起数据建模来，有一点是首先 ...
　　笔者从98年进入数据库及数据仓库领域工作至今已经有近八年的时间，对数据建模工作接触的比较多，创新性不敢谈，本文只是将工作中的经验总结出来，供大家一同探讨和指正。

　　提起数据建模来，有一点是首先要强调的，数据建模师和DBA有着较大的不同，对数据建模师来说，对业务的深刻理解是第一位的，不同的建模方法和技巧是为业务需求来服务的。而本文则暂时抛开业务不谈，主要关注于建模方法和技巧的经验总结。

　　从目前的数据库及数据仓库建模方法来说，主要分为四类。

　　第一类是大家最为熟悉的关系数据库的三范式建模，通常我们将三范式建模方法用于建立各种操作型数据库系统。

　　第二类是Inmon提倡的三范式数据仓库建模，它和操作型数据库系统的三范式建模在侧重点上有些不同。Inmon的数据仓库建模方法分为三层，第一层是实体关系层，也即企业的业务数据模型层，在这一层上和企业的操作型数据库系统建模方法是相同的；第二层是数据项集层，在这一层的建模方法根据数据的产生频率及访问频率等因素与企业的操作型数据库系统的建模方法产生了不同；第三层物理层是第二层的具体实现。

　　第三类是Kimball提倡的数据仓库的维度建模，我们一般也称之为星型结构建模，有时也加入一些雪花模型在里面。维度建模是一种面向用户需求的、容易理解的、访问效率高的建模方法，也是笔者比较喜欢的一种建模方式。

　　第四类是更为灵活的一种建模方式，通常用于后台的数据准备区，建模的方式不拘一格，以能满足需要为目的，建好的表不对用户提供接口，多为临时表。

　　下面简单谈谈第四类建模方法的一些的经验。

　　数据准备区有一个最大的特点，就是不会直接面对用户，所以对数据准备区中的表进行操作的人只有ETL工程师。ETL工程师可以自己来决定表中数据的范围和数据的生命周期。下面举两个例子：

　　1）数据范围小的临时表

　　当需要整合或清洗的数据量过大时，我们可以建立同样结构的临时表，在临时表中只保留我们需要处理的部分数据。这样，不论是更新还是对表中某些项的计算都会效率提高很多。处理好的数据发送入准备加载到数据仓库中的表中，最后一次性加载入数据仓库。

　　2）带有冗余字段的临时表

　　由于数据准备区中的表只有自己使用，所以建立冗余字段可以起到很好的作用而不用承担风险。

　　举例来说，笔者在项目中曾遇到这样的需求，客户表{客户ID，客户净扣值}，债项表{债项ID，客户ID，债项余额，债项净扣值}，即客户和债项是一对多的关系。其中，客户净扣值和债项余额已知，需要计算债项净扣值。计算的规则是按债项余额的比例分配客户的净扣值。这时，我们可以给两个表增加几个冗余字段，如客户表{客户ID，客户净扣值，客户余额}，债项表{债项ID，客户ID，债项余额，债项净扣值，客户余额，客户净扣值}。这样通过三条SQL就可以直接完成整个计算过程。将债项余额汇总到客户余额，将客户余额和客户净扣值冗余到债项表中，在债项表中通过（债项余额×客户净扣值/客户余额）公式即可直接计算处债项净扣值。

　　另外还有很多大家可以发挥的建表方式，如不需要主键的临时表等等。总结来说，正因为数据准备区是不对用户提供接口的，所以我们一定要利用好这一点，以给我们的数据处理工作带来最大的便利为目的来进行数据准备区的表设计。

　　行业借鉴经验：

　　数据仓库架构经验谈

　　对于数据仓库的架构方法，不同的架构师有不同的原则和方法，笔者在这里来总结一下当前常采用的架构方式及其优缺点。这些架构方式不限于某个行业，可以供各个行业借鉴使用。

　　首先需要说明的一点是，目前在数据仓库领域比较一致的意见是在数据仓库中需要保留企业范围内一致的原子层数据。而独立的数据集市架构（Independent data marts）没有企业范围内一致的数据，很可能会导致信息孤岛的产生，除非在很小的企业内或只针对固定主题，否则不建议建立这样的架构方式。联邦式的数据仓库架构（Federated Data Warehouse Architecture）不管是在地域上的联邦还是功能上的联邦都需要先在不同平台上建立各自的数据仓库，再通过参考（reference）数据来实现整合，而这样很容易造成整合的不彻底，除非联邦式的数据仓库架构也采用Kimball的总线架构（Bus Architecture）中类似的功能，即在数据准备区保留一致性维度（Conformed Table）并不断更新它。所以，这两种架构方式不在讨论范围之内。下面主要讨论剩下的三种架构方式。

　　1）三范式（3NF）的原子层＋数据集市

　　这样的数据仓库架构最大的倡导者就是数据仓库之父Inmon，而他的企业信息工厂（Corporate Information System）就是典型的代表。这样的架构也称之为企业数据仓库（Enterprise Data Warehouse，EDW）。企业信息工厂的实现方式是，首先进行全企业的数据整合，建立企业信息模型，即EDW。对于各种分析需求再建立相应的数据集市或者探索仓库，其数据来源于EDW。三范式的原子层给建立OLAP带来一定的复杂性，但是对于建立更复杂的应用，如挖掘仓库、探索仓库提供了更好的支持。这类架构的建设周期比较长，相应的成本也比较高。

　　2）星型结构（Star Schema）的原子层＋HOLAP

　　星型结构最大的倡导者是Kimall，他的总线架构是该类架构的典型代表。总线架构实现方式是，首先在数据准备区中建立一致性维度、建立一致性事实的计算方法；其次在一致性维度、一致性事实的基础上逐步建立数据集市。每次增加数据集市，都会在数据准备区整合一致性维度，并将整合好的一致性维度同步更新到所有的数据集市。这样，建立的所有数据集市合在一起就是一个整合好的数据仓库。正是因为总线架构这个可以逐步建立的特点，它的开发周期比其他架构方式的开发周期要短，相应的成本也要低。在星型结构的原子层上可以直接建立聚集，也可以建立HOLAP。笔者比较倾向于Kimball的星型结构的原子层架构，在这种架构中的经验也比较多。

　　3）三范式（3NF）的原子层＋ROLAP

　　这样的数据仓库架构也称为集中式架构（Centralized Architecture），思路是在三范式的原子层上直接建立ROLAP，做的比较出色的就是MicroStrategy。在三范式的原子层上定义ROLAP比在星型结构的原子层上定义ROLAP要复杂很多。采用这种架构需要在定义ROLAP是多下些功夫，而且ROLAP的元数据不一定是通用的格式，所以对ROLAP做展现很可能会受到工具的局限。这类架构和第一类很相似，只是少了原子层上的数据集市。

　　总结来说，这三种数据仓库的架构方式都是不错的选择。对于需要见效快、成本低的项目可以考虑采用第二种总线架构，对于资金充足并有成熟业务数据模型的企业可以考虑采用第一种架构或第三种架构。

　　应用难点技巧：

　　变化数据捕获经验谈

　　在数据仓库系统中，一个很重要的目的就是保留数据的历史变化信息。而变化数据捕获（Change Data Capture，CDC）就是为这个目的而产生的一项技术。变化数据捕获常用的方法有：1）文件或者表的全扫描对比，2）DBMS日志获取，3）在源系统中增加触发器获取，4）基于源系统的时间戳获取，5）基于复制技术的获取，6）DBMS提供的变化数据捕获方法等。其中，由DBMS提供变化数据捕获的方法是大势所趋，即具体的捕获过程由DBMS来完成。

　　像银行、电信等很多行业的操作记录生成后就不会改变，只有像客户、产品等信息会随时间发生缓慢的变化，所以通常的变化数据捕获是针对维度表而言的。Kimball对缓慢变化维的分析及应对策略基本上可以处理维度表的各种变化。

　　而对于一些零售行业，像合同表中的合同金额类似的数值在录入后是有可能会发生改变的，也就是说事实表的数据也有可能发生变化。通常对于事实表数据的修改属于勘误的范畴，可以采用类似缓慢变化维TYPE 1的处理方式直接更新事实表。笔者不太赞同对事实表的变化采用快照的方式插入一条新的事实勘误记录，这样会给后续的展现、分析程序带来太多的麻烦。

　　接下来要讨论的是笔者曾经遇到的一个颇为棘手的事实表数据改变的问题，该事实表的主键随表中某些数据的变化发生改变。以其中的一个合同表为例，该合同表的主键是由“供货单位编号”＋“合同号”生成的智能主键，当其中的“供货单位编号”和“合同号”中任何一个发生变化时，该合同表的主键都会发生变化，给变化数据捕获带来了很大的麻烦。

　　项目中，笔者的处理方式是采用触发器的办法来实现变化数据捕获。具体的实现方式是：

　　1）建立一个新表作为保存捕获的数据表使用，其中字段有“原主键”、“修改后主键”、及其他需要的字段，称为“合同捕获表”。

　　2）在原合同表Delete和Update时分别建立触发器，当删除操作发生时，建在Delete上的触发器会插入一条记录到“合同捕获表”，其中“修改后主键”字段为空，表示该记录是删除的记录；当发生更新时，将“原主键”、“修改后主键”及其他需要记录的字段都保存入“合同捕获表”中，表示该记录被修改过，如果“原主键”和“修改后主键”不同，则表示主键被修改，如果相同，则表示主键没有被修改。

　　3）由于操作系统中的主键通常会成为数据仓库中事实表的退化维度，可能仍会起主键的作用。所以在数据加载时，需要分情况判断“合同捕获表”的数据来决定是否更新事实表中的退化维度。

　　可以说，这样的基于触发器的变化数据捕获方法并不是一个很好的选择。首先这需要对源系统有较大的权限；其次，触发器会给源系统的性能带来很大的影响。所以除非是没有别的选择，否则不建议采用这种方法。

　　而对于这样的情况，我们在建立操作型数据库系统时完全可以避免。下面是对操作型数据库系统建立者的几点建议：1）操作型系统的主键不要建立成智能型的，至少不要建立成会变化的。2）操作型系统的表中需要加入操作人和操作时间字段，或者直接加入时间戳。3）操作型系统中操作型数据最好不要直接在原值上修改，可以采用“冲红”的方式加入新的记录。这样后续建立数据仓库时就不需要考虑事实表数据的变化问题。

　　最后，期待各大数据库管理系统的厂商能尽快在DBMS层提供功能强大、简单好用的变化数据捕获功能，目前Oracle已经有了这个功能。毕竟技术方面复杂的事情留给厂商做是一个趋势，而我们做应用的则更关注于业务。

Tuesday, August 26, 2008

shame

"rebel state"

Why Kosovo, Bosnia and Herzegovina are not REBEL STATE???

SHAME BBC. Shame western media this time

九步成为演讲高手

演讲者：一家IT咨询公司的销售人员
听众：潜在的决策层客户
目标：将你的公司描述给他们
乔•弗莱德式风格
作为系统集成技术的专家，我们能够帮助你们提升运作效率。我们将利用我们成熟的新技术储备帮助你们降低现有的资产负债率，为你们带来更高的回报，同时帮助你们提升团队意识。我们坚信自己是能够带领你们走向更高阶梯的伙伴，将以娴熟的战术和策略帮助你们取得成功。我们所拥有的世界级的专业人员广泛地采用业界最优异的技术手段。
讲故事式的风格
我将为你们讲述艾利森——我们的一位高级经理的故事，这样你们就能以最好的方式了解我们的公司了。就在上周，她接待了一家来自西海岸的客户，这家公司刚刚完成了并购，而原先两家公司的系统看上去很难协调在一起，因此不得不报废掉原来的系统完全换新的。艾利森的建议是：“先不要买任何一台新的电脑。”她在客户那里实地观察和分析了几天之后，发现运用一些专业的客户端软件，现有的两个系统完全可以很好地实现协同工作。就这样，客户仅花费60万美元就搞定了这一切，而原先客户计划的是300万美元。那家公司的总裁根本不敢相信自己的耳朵。这个方案使他大喜过望，他甚至邀请艾利森和她的家人今年夏天到他的山间别墅去小住一周。
我们还有很多类似这样的故事。如果您选择了我们，也就选择了像艾利森这样的优秀专业人员来为您服务。我们明白并深刻地理解，为使您的公司在每个季度都能取得良好的业绩，您的电脑系统将很好地服务于您的顾客和您公司的业务流程。我们期待着与您合作

这一切都该结束了。我不会再容忍这种对于安全生产漫不经心的态度。从这一刻起，一旦我发现你们没有遵守安全生产的规则，我会立刻请你们走人，不再给你们发放工资。我不在乎我们有多忙，要么这些事故报告恢复到我们可以接受的水平，要么我们就走着瞧。
讲故事式的风格
看看周围吧，这里到处都是安全生产标志，但我肯定它们并没有真正起到作用。如果我们不遵守安全生产的规则，我们就是在拿生命开玩笑。弗兰克（向弗兰克示意），我知道你是那支棒球队的教练，如果你的腕关节摔坏了的话——我是说如果——那你把球扔给内场手就很困难了。迈克尔（向迈克尔示意），我知道你和你的妻子周末喜欢去钓鱼，如果你的脚缠着绷带、步履蹒跚的话，你就无法划着小船在湖中行动自如、控制好浮标并钓到鱼了。一旦你受伤，这就意味着我得打电话给你所爱的人，告诉他们到诊所或医院来见你；告诉他们你病了，你需要他们。我痛恨去打这些电话，我痛恨把坏消息带给你的家人。从今天起，我不会再这样做了。因为：如果让我看见你们中的任何人没有遵守我们一致认可的安全注意事项，我就立即把你们遣送回家。然后由你们自己去向你的家人解释，为什么你丢掉了工作。我们的安全记录从今天起将会被改写，因为我想要让你们今晚以及每晚都能够安全回家。

这简直就是个灾难。我们需要增加50%的服务设施。据我们的设备经理统计，这样做的成本将是40万美元。我们需要马上扩充服务设施，以便能完成我们的使命。
讲故事式的方式
我们越来越难以完成我们的组织使命。我知道这一点，因为上个月我没怎么见到罗纳德。当他终于出现的时候，我问他一直在哪儿，他说他上次排了一个小时队才在我们这里吃到了饭，他说在垃圾桶里觅食都比在这里等待花的时间要少。让人难过的是，他还说，居然他更愿意和我们在一起。我也想他能在这里，而且我知道你们也是这么想。这里有一个安全的、体面的环境，而不是个贫民窟。我们应该让罗纳德至少每天都有一次机会可以享受到我们的欢迎和尊重。
可是，罗纳德说他害怕排上一个多小时的队，而轮到他的时候已经没有食物了。对于我们所有在座的人来说，得知罗纳德宁愿去搜垃圾桶也不愿和我们在一起，是件非常难过的事情。
罗纳德是我们要服务的对象，但我们却没有能力为他和其他人提供良好的服务。我们需要更多的空间来进行我们的准备工作，需要更多的火炉、更多的服务窗口，以及更多干净安全的地方，容纳像罗纳德一样的人们。我希望看到罗纳德和我们在一起享用晚餐，而不是在垃圾堆里搜寻。他理应得到更好的服务，就像我们所有其他的顾客一样。我们需要花掉40万美元，而我将告诉你们为什么这对于改善我们的设施和社区环境是一笔合理的投资

fwd: SPSS Clementine Scripts基本语法

有点深的好书：袁：实用数据挖掘（意大利人P Giudici著）

多个分支，比起用缓存，然后一个一个分支运行，效率更低

如果用CSV, remember to set to STRING or set to REAL properly

问题：如何调用脚本生成的节点名？ ^yourname

最好不要重名节点，否则无法用script delete: ambigious nodes.以下可能可以help #Gets a reference to an existing node. This can be a useful way to ensure non-ambiguous references to nodes.
var mynode
set mynode = get node flag1:derivenode
position ^mynode at 400 400

分布节点：normalize by color,让bar等长

However, blanks (user defined missing values) do contribute to the aggregate summaries and these values should be replaced with $null$

We will also sort the data to make the data aggregation more efficient？？

Distinct对大数据集不好，you can try an alternative: sort the data on the key fields, and then use the CLEM expression @OFFSET with a Select node to select (or discard) the first distinct record from each group

If the Keys are contiguous check box is selected, values for the key fields will only be treated as equal if they occur in adjacent records

Missing data太多的列应该删除。

是否划为outlier取决于分析目的：找有钱人，普通顾客（outlier不好）;categorical fields 通常没有outlier/anormality;

类别的某个值太少，则无统计意义（跑步上班），infrequent behaviour

Anormality是从很多列里找不正常记录；
outlier也有可能是2列交叉：如低收入高奢侈

线形回归可被小比例的outlier影响，decision tree等则不容易；用histogram overlay或mean来看分布的影响.通常聚类时不用demographic data，而是用来校验。高度相关的列如果同时采用，则等于加了不比要的权

SPSS Clementine Scripts基本语法
1.作者：bolow (cnjm)

2.本文大部分内容为笔者在业余时间根据SPSS的官方文档整理出来，
* 按语法内容分成若干章节，大多数为语法元素的示意，讲解不多，
* 阅读时可能

Standard script: stored in a file

BE AWARE of ^ (variable ref, but not very useful?) and \ (line continous character, or maybe embeded single quote id, also)
If you use quotation marks within a CLEM expression, make sure that each quotation mark is preceded by a backslash (\)—for example:

set :node.parameter = "BP = \"HIGH\""

Script Syntax section: Variable names, such as ^mystream, are preceded with a caret (^) symbol when referencing an existing variable whose value has already been set. The caret is not used when declaring or setting the value of the variable. See Referencing Nodes for more information

. You can specify nodes by name--for example, DRUG1n. You can qualify the name by type--for example, Drug:neuralnetnode refers to a Neural Net node named Drug and not to any other kind of node.

• You can specify nodes by type only—for example, :neuralnetnode refers to all Neural Net nodes. Any valid node type can be used—for example, samplenode, neuralnetnode, and kmeansnode. The node suffix is optional and can be omitted, but including it is recommended because it makes identifying errors in scripts easier.

• You can reference each node by its unique ID as displayed on the Annotations tab for each node. Use an "@" symbol followed by the id, for example @id5E5GJK23L.custom_name = "My Node". See Annotating Nodes and Streams for more information.

@MEAN(BALANCE,5),流过的最后5个record @SUM(field),所有record

一变量与类型

1 域（field）的名称以及变量名以字母开头，可以包含字母、数字以及下划线。
如果命名不遵循以上原则，名称需要用单引号包括

2 数据类型使用样式
字符串 --"c1", "Type 2", "a piece of free text"
整数 --12, 0, –189
实数 --12.34, 0.0, –0.0045
日期时间 --05/12/2002, 12/05/2002, 12/05/02
字符 --`a` 或者 3
列表 --[1 2 3], [’Type 1’ ’Type 2’]

3 引号使用规则
字符串 --最好使用双引号。虽然单引号也能用，但是有时候会和域名混淆
字符 --使用后引号`（ESC键下面的那个）
也可以使用数字
也可以使用字符串中的索引比如lowertoupper("druga"(5)) —> "A"
域名 --通常是不用加引号的，但是如果包含了空格等特殊字符就要加上双引号
如果给没有定义的域名加上引号，可能会被认为是字符串
参数名 --必须使用单引号

二语法

1 运算符优先顺序
函数参数
函数调用
xx
x / mod div rem
+ -
> < >= <= /== == = /=

2 结构控制
a if..then..else
if EXPR then STATEMENTS 1
else STATEMENTS 2
endif

举例
if ^param = 24 then
create derivenode
else exit 2
endif

b for循环
× for PARAMETER in LIST
STATEMENTS
endfor

× for PARAMETER from N to M
STATEMENTS
endfor

× for PARAMETER in_models
STATEMENTS
endfor
对生成模型面板上的模型进行枚举操作，模型的名字被传到PARAMETER变量中

× for PARAMETER in_fields_at NODE
STATEMENTS
endfor
对node节点下游（downstream）节点的每个字段进行操作

× for PARAMETER in_fields_to NODE
STATEMENTS
endfor
对node节点下游（upstream）节点的每个字段进行操作

× exit
for PARAMETER in_streams
STATEMENTS
endfor
对当前打开的流进行枚举操作

3 赋值示意
×set :balancenode.directives = [{1.3 "Age > 60"}]
set :fillernode.cHigh\")"
set :derivenode.formula_expr = "substring(5, 1, Drug)"
set Flag:derivenode.flag_expr = "Drug = X"
set :selectnode.c
set :derivenode.formula_expr = "Age - GLOBAL_MEAN(Age)"
set nodename.tablename="mytable"
set: databasenode.table="atablename"
set my_node = get node :plotnode
set :samplenode {
max_size = 200
mode = "Include"
sample_type = "First"
}
set :balancenode.directives = [{1.3 "Age > 60"}]
set :fillernode.cHigh\")"
set :derivenode.formula_expr = "substring(5, 1, Drug)"
set Flag:derivenode.flag_expr = "Drug = X"
set :selectnode.c
set :derivenode.formula_expr = "Age - GLOBAL_MEAN(Age)"
完整的表达形式应该是 set nodename:NODETYPE.prop=value
在独立脚本中引用节点要加^

× 设置超节点参数
set mySuperNode.parameters.minvalue = 30
set :process_supernode.parameters.minvalue = 30
set :process_supernode.parameters.minvalue = ""
set mySuperNode:process_supernode.parameters.minvalue = 30
set mySuperNode.parameters.’Data_subset:samplenode.rand_pct’ = 50
set :source_supernode.parameters.’Data_subset:samplenode.rand_pct’= 50
在定义一个超节点的参数的时候，必须使用短名

4 设置一个图标的位置
position nodename at 450 50

5 执行某个节点
execute :exe_node_name

6 新建一个节点和流
×创建节点
var x
set x = create typenode
rename ^x as "mytypenode"
position ^x at 200 200
var y
set y = create varfilenode
rename ^y as "mydatasource"
position ^y at 100 200
connect ^y to ^x

set node = create typenode
rename ^node as "mytypenode"
position ^node at 200 200
set node = create varfilenode
rename ^node as "mydatasource"
position ^node at 100 200
connect mydatasource to mytypenode

×创建流
create STREAM DEFAULT_FILENAME

7 访问数据结果
×value RESULT at ROW COLUMN

×set num_rows = :tablenode.output.row_count

×set table_data = :tablenode.output
set last_value = value table_data at num_rows num_cols

8 文件操作
× 打开文件
open MODE FILENAME
MODE create/append

× 关闭文件
close FILE

× 举例
set file = open create ’C:/Data/script.out’
for I from 1 to 3
write file ’Stream ’ >< I
endfor
close file

9 连接节点
create tablenode
create variablefilenode
connect :variablefilenode to :tablenode
set :variablefilenode.full_filename = "C:\Program Files\Clementine\8.1\demos\DRUG1n"
execute ’Table’
set param = value :tablenode.output at 1 1
if ^param = 24 then
create derivenode
else exit 2
endif

使用Clem expression in Scripts:
You can use CLEM expressions, functions, and operators within Clementine scripts; however, your scripting expression cannot contain calls to any @ functions, date/time functions, and bitwise operations. Additionally, the following rules apply to CLEM expressions in scripting:

• Parameters must be specified in single quotes and with the $P- prefix.

• CLEM expressions must be enclosed in quotes. If the CLEM expression itself contains quoted strings or quoted field names, the embedded quotes must be preceded by a backslash (\). See Scripting Syntax for more information.

You can use global values, such as GLOBAL_MEAN(Age), in scripting; however, you cannot use the @GLOBAL function itself within the scripting environment.

Examples of CLEM expressions used in scripting are:

set :balancenode.directives = [{1.3 "Age > 60"}]
set :fillernode.condition = "(Age > 60) and (BP = \"High\")"
set :derivenode.formula_expr = "substring(5, 1, Drug)"
set Flag:derivenode.flag_expr = "Drug = X"
set :selectnode.condition = "Age >= '$P-cutoff'"
set :derivenode.formula_expr = "Age - GLOBAL_MEAN(Age)"

各节点的property在此：Scripting, automation, and CEMI
Properties Reference

derivenode Properties
Derive node,Derive node,Derive node
properties,properties,properties
derivenode properties,derivenode properties,derivenode properties
The Derive node modifies data values or creates new fields from one or more existing fields. It creates fields of type formula, flag, set, stat, count, and conditional. See Derive Node for more information.
derivenode properties Data type Property description
new_name string Name of new field.
See the example below for usage.
mode Single
Multiple Specifies single or multiple fields.
fields [field field field] Used in Multiple mode only to select multiple fields.
name_extension string Specifies the extension for the new field name(s).
add_as Suffix
Prefix Adds the extension as a prefix (at the beginning) or as a suffix (at the end) of the field name.
result_type Formula
Flag
Set
State
Count
Conditional The six types of new fields that you can create.
formula_expr string Expression for calculating a new field value in a Derive node.
flag_expr string
flag_true string
flag_false string
set_default string
set_value_cond

特殊符Literal text blocks that include spaces, tabs, and line breaks can be included in scripts by setting them off in triple quotes. Any text within the quoted block is preserved as literal text, including spaces, line breaks, and embedded single and double quotes. No line continuation or escape characters are needed.

Clem expression用在script中需要用双引号包围之　set :fillernode.condition = "(Age > 60) and (BP = \"High\")"

Kohenan: default 7x10 is too many, 3x4 is better.设定（exponienal decay倾向于生成1个特别大的cluster；neighbourhood phase1在grid小时不应该大于phase2）。接近指定的镞数上限

Most of the time when using Apriori or GRI, we will

either not define data as blanks, or we will define data

as blanks and then remove the records with blank values.

Either one of these approaches will lead to no confusion

in the association rules created.要么不定义，如果定义，

则移去带空白的列。Apriori把定义的missing视为合法值，gri

找到空白规则，但统计算rule除去，所以得confidence=0的规则。CARMA基本不受影响（正常去除空白值）

Carma node works only with fields of storage type string

；而且只用true,不识别false
Override…True from the context menu
Right-click again and select Set Storage…String

(discussed. Carma allows user specifying rule support,

leading to simpler rules;
Carma has the limitation that the data be of type flag

(and storage type string) with tabular data,if the data

are changed to transactional format, Carma can use

fields of type set.如果要出负规则，则需要对调carma的真假

值
is decreased to the value of Neighborhood (Phase 2) +1. By default, this value is equivalent to that for Neighborhood (Phase 1), so no decrease occurs in phase 1

In small grids, the value of Neighborhood (Phase 1) shouldn’t be set higher than the default of Neighborhood (Phase 2); otherwise the whole grid will be affected.
? Don’t change the Cycles settings unless you get an odd-looking solution; they should be large enough for almost all circumstances, except for unusual data or a large number of clusters.
? The Initial Eta settings are the most likely place to begin to modify a network, and you would normally set Initial Eta (Phase 1) a bit higher, and perhaps Initial Eta (Phase 2) as well.
Techniques for Clustering 2 - 21
Clustering and Association Models with Clementine
? If you expect to find one dominant cluster, leave the Learning decay rate as Exponential. If not, you can try using a linear decay.

In the Kohonen node, blanks are handled by substituting “neutral” values for the missing ones. For range and flag fields with missing values (blanks and nulls), the missing value is replaced with 0.5 (for range fields, this is done after the original values have been transformed into a 0-1 range). Range field values below the lower bound (in the last Type node) will be set to the lower bound and values above the upper bound will be set to the upper bound value. For set fields, the derived indicator field values are all set to 0.0. This is the same missing data handling as we found in the K-Means node.

Kmeans:The Encoding value for sets value can be set between .0001 and 1.0, inclusive. Values below .70711 will decrease the importance of set and flag fields, while values above that will do the reverse.
Flag fields are not encoded in this manner, as doing so may distort the distances between records. They are given values of 0 and 1.
If symbolic fields are included in a clustering solution, we recommend leaving the encoding value at its default setting unless you have a good reason to change the influence of these fields.

solution compared to numeric fields. Accordingly, the value of .70711 is used instead (the square root of 1/2).

2step's ASSUMPTION
It should be noted that the likelihood function assumes that numeric predictors follow normal distributions and the symbolic predictors follow multinomial distributions; the former assumption is not.且不支持missingdata,records with blank,null,missing将被remove
3种cluster都依赖于输入顺序。

Monday, August 25, 2008

拒绝装饰性装修

如果制作一些优质的平面（墙纸等），用于代替装修，然后把资源积累下来，推动社会进步。

上搜狐，看奥运，这是sohu 10月28日的首页的heading..真可悲。不注重细节的公司

高建华：不战而胜

所谓战略是做正确的事情

所谓战术是把事情做正确

销售漏斗：有效的管理工具（估算销量，管理销售渠道，客户）

销售员的职责：销量，回款，利润，市场，客户满意，销售漏斗

不能盲目投资，扩张，一定要找好自己的优势所在（产品，价格，小市场）

先行者如何避免牺牲，或无法把控市场，进入无序竞争：

1。要么过河拆桥，设置壁垒，尽快垄断

2。要么通过授权和合作（技术转让），尽快把馅饼和自己的影响力做大

曾经问过微软这个问题，我所在公司也曾经经历过。。只能说，要扩大技术优势，才可能站住脚

短期内获取暴利（投资小见效快等），只会招致大量的竞争者，

赚来的钱可以用来防御（好企业用来进行市场调查，科技开发，人员培训，国人则喜欢做广告，一旦广告停则后劲不足）

为何中国20年来产业一直恶性竞争，重复建设？（家电，进入pc)1.没有竞争优势。2，1不懂如上先行者的正确策略；3不懂市场细分找准目标客户

Tuesday, August 19, 2008

leading 000 are important in programming

All seemed working in a AJAX, we tested with some user IDs and it worked, however when our customers log in, they get an error which indicates the correct flag which IS in database is not picked up.

It is not http/https difference nor AJAX not working,

when I wrote the code, the sMU_id returned from Database is varchar2 so can be 01112222, however when it was assigned to mu_id this 0 (if there is one) is lost as the default type of vbscript is chosen automatically as number!!!

asp由于自动类型，把赋进来的字符串（可能以0开头）变为了数值，所以丢了0，再返回数据库作sql查询时，自然查不出来了。我们测试的id都是非0开头，所以没察觉这个问题。

Response.Write("var mu_id; mu_id="" "";mu_id="""&sMU_id & """;"& vbCrLf )

Thursday, August 14, 2008

fwd: IDMER 统计学书籍推荐

http://blogger.org.cn/blog/more.asp?name=idmer&id=38752

高健华（不战而胜），此书结合中国国情，有分析，有见地，不错

主要内容来自http://www.cos.name/bbs/read.php?tid=5362
　　
　　统计学读物推荐
　　
　　一、统计学基础部分
　　 1、《统计学》 David Freedman等著，魏宗舒，施锡铨等译中国统计出版社　　据说是统计思想讲得最好的一本书，读了部分章节，受益很多。整本书几乎没有公式，但是讲到了统计思想的精髓。
　　 2、《Mind on statistics(英文版）》机械工业出版社
　　只需要高中的数学水平，统计的扫盲书。有一句话影响很深： Mathematics as to statistics is something like hammer, nails, wood as to a house, it's just the material and tools but not the house itself。
　　3、《Mathematical Statistics and Data Analysis（英文版.第二版）》机械工业出版社
　　看了就发现和国内的数理统计树有明显的不同。这本书理念很好，讲了很多新的东西，把很热门的Bootstrap方法和传统统计在一起讲了。Amazon上有书评。
　　4、《Business Statistics a decision making approach（影印版）》中国统计出版社
　　在实务中很实用的东西，虽然往往为数理统计的老师所不屑
　　5、《Understanding Statistics in the behavioral science（影印版）》中国统计出版社
　　和上面那本是一个系列的。老外的书都挺有意思的
　　6、《探索性数据分析》中国统计出版社和第一本是一个系列的。大家好好看看陈希儒老先生做的序，可以说是对中国数理统计的一种反思。
　　7、数理统计引论
　　著译者：陈希孺
　　出版者：科学出版社
　　《数理统计学简史》陈希孺
　　8 《概率论与数理统计教程》魏宗舒
　　
　　二、回归部分
　　1、《应用线性回归》中国统计出版社
　　还是著名的蓝皮书系列，有一定的深度，道理讲得挺透的。看看里面对于偏回归系数的说明，绝对是大开眼界啊！非常精彩的书
　　2、《Regression Analysis by example (3rd Ed影印版)》
　　这是偶第一本从头到底读完的原版统计书，太好看了。那张虚拟变量写得比小说都吸引人。没什么推导，甚至说“假定你有统计软件可以算出结果”，主要就是将分析，怎么看图，怎么看结果。看完才觉得回归真得很好玩
　　3、《Logistics回归模型——方法与应用》王济川郭志刚高等教育出版社不多的国内的经典统计教材。两位都是社会学出身，不重推导重应用。每章都有详细的SAS和SPSS程序和输出的分析。两位估计洋墨水喝得比较多，中文写的书，但是明显老外写书的风格
　　
　　三、多元
　　0、《多元统计分析引论》张尧庭，方开泰著科学出版社
　　1、《应用多元分析（第二版）》王学民上海财经大学出版社
　　现在好像就是用的这本书，但是请注意，这本书的亮点不是推导，而是后面和SAS结合的部分，以及其中的一些想法（比如P99 n对假设检验的影响，绝对是统计的感觉，不是推推公式就能感觉到的）。这是一本国内很好的多元统计教材。
　　2、《Analyzing Multivariate Data（英文版）》 Lattin等著机械工业出版社这本书有很多直观的感觉和解释，非常有意思。对数学要求不高，证明也不够好，但的确是“统计书”，不是数学书。
　　3、《Applied Multivariate Statistical Analysis (5th Ed影印版)》 Johnson & Wichem 著中国统计出版社
　　个人认为是国内能买到的最好的多元统计书了。Amazon 上有人评论，评价很高的。不过据王学民老师说，这本书的证明还是有不太清楚，老外实务可以，证明实在不咋的，呵呵
　　
　　四、时间序列
　　1、《商务和经济预测中的时间序列模型》弗朗西斯著
　　Amazon 上五星推荐的书，讲了很多很新的东西也非常实用。我看完才知道，原来时间序列不知有AR(1) MA(1)啊，哈
　　2、《Forecasting and Time Series an applied approach(third edition)》 Bowerman & Connell 著
　　本书的主讲Box-Jenkins(ARIMA)方法，附上了SAS和Minitab程序
　　
　　五、抽样
　　1、《抽样技术》科克伦著张尧庭译
　　绝对是该领域最权威，最经典的书了。王学民老师说：这本书不是那么好懂的，数学系的人，就算看得懂每个公式，未必能懂它的意思（不是数学系的人，还是别看了吧）。
　　2、《Sampling: Design and Analysis（影印版)》 Lohr著中国统计出版社
　　讲了很多很新的方法，无应答，非抽样误差，再抽样，都有讨论。也很不好懂，当时偶是和《Advance Microeconomic
　　Theory》一起看的，后者被许多人认为是梦魇，但是和前者一比，好懂多了。主要还是理念上的差距。我们的统计思想和数据感觉有待加强啊
　　
　　六、软件及其他
　　1、《SAS软件与应用统计分析》王吉利张尧庭主编
　　好书啊！！！！
　　2、《SAS V8基础教程》汪嘉冈编中国统计出版社
　　主要讲编程，没怎么讲统计。如果想加强SAS编程可以考虑。
　　3、《SPSS11统计分析教程（基础篇）（高级篇）》张文彤北京希望出版社
　　当初第一次看这本书，发现怎么几乎都看不懂，尤其是高级篇，现在终于搞清楚了：）
　　4、《金融市场的统计分析》张尧庭著广西师范大学出版社
　　张老师到底是大家，薄薄的一本书，言简意赅，把主要的金融模型都讲清楚了。看完会发现，分析金融单单数学模型还是纸上谈兵，必须加上统计模型和统计方法才能真正应用。本书用的多元统计（代数知识）比较深。
　　
　　其它
　　Common Errors in Statistics : (and How to Avoid Them)
　　Good P.I., Hardin J.W.
　　John Wiley & Sons; 2003; 240стр.; ISBN: 0471460680

Wednesday, August 13, 2008

营销人必读经典营销类电子书籍分享了（转载）(

Don't extend your brand name into new product line (use a new brand), it will dilute the strong link between the brand and the power product

Product brand extension is like alcohol (a short term extragattor and long term depressor)....

《定位》、《新定位》、《五轮书》、《战争论》、杰克特劳特《营销战》、《君王论》、高建华《不战而胜》、《鬼谷子全书》、《非常营销》、科特勒《营销管理》、路长全《切割销售》 …………
　　

　　《创新与企业家精神》（彼得.德鲁克）、《请给我结果》、《比强者更强》、《赢》杰克韦尔奇 ……
　　
商战小说的及格标准是不能违反商业最基本的逻辑。
　　例如：圈子圈套的第一段-客户去香港谈价格，菜农都知道这是不可能的事情。
　　
　　按照这个标准，我推荐三本书给您。
　　
　　第一美国总统演讲撰稿人克里斯马修斯写的硬球-政治是这样玩的。
　　
　　虽然是写的政治人物，原理是通的。“硬球”既硬又圆，用来比喻政治家的猴戏和手段，是很经典书籍，值得珍藏。
　　
　　
　　第二维亚康姆总裁雷石东的自传——赢得激情。好的没话讲，也比较正面。
　　
　　
　　第三我不得不杀人以色列特工组织摩萨德前女特工自传。
　　
　　结尾处对911的分析，既专业又经典。让你了解什么是真正的竞
　　开始分享了
　　
　　
　　 2001年，美国营销学会评选有史以来对美国营销影响最大的观念，结果不是劳斯·瑞夫斯的USP、大卫·奥格威的品牌形象，也不是菲利浦·科特勒所架构的营销管理及消费者“让渡”价值理论，不是迈克尔·波特的竞争价值链理论，而是艾·里斯与杰克·特劳特提出的“定位”理论。这本管理战略的圣经、有史以来最富影响力的营销学著作，改变了市场游戏规则，广告和营销的旧时代一去不复返了！
　　
　　　　自从杰克·特劳特与艾·里斯提出“定位”观念之后，它已成为世界上最伟大的商业词汇之一。但大师就是大师，他们永远不会停住探索的脚步。当人们津津乐道于“Positioning”(定位)给营销企划带来的革命性变化之时，作者却在探讨“Re-Positioning”或“New-Positioning”了。在本书中，作者细致剖析人脑的结构与功能，引证大量心理学观点，无可争议地论证定位－再定位的要素、过程及其误区。
　　
　　　　身为实务型营销战略专家，本书作者与菲利蒲·科特勒、迈克尔·波特等学院派大师不同之处在于，他们始终抓住案例，从剖析营销史上的经典案例入手，总结出具体而又可供借鉴、学习的商战原则，《定位》是这样，《新定位》也同样如此。而且，其中许多战例就是作者在以往的咨询实务中创造出来的，例如饮料市场的细分、“莲花组件”的营销方案、美国西南航空公司的品牌定位……
　　
　　《定位》、《新定位》、科特勒《营销管理》文件较大，以文件形式发送，需要的朋友留下邮箱。
　　
　　其他书籍直接到http://www.wzdyt.com/gongxiang.asp下载，非常方便的

经验曲线是20世纪30年代由美国航空工业提出的，最初主要用于工时定额的制定和成本的估算，在生产和财务管理上起到了较好的作用。60年代后，随着经营战略理论的发展，经验曲线开始成为评价企业战略的一项工具，在企业的经营管理工作中逐渐发挥出更大的作用。
1、经验曲线应用于行业成本分析
在某个行业内，当所有的企业都适用于一条同样的经验曲线时，它们相互之间在成本上的实力地位取决于其市场占有率的大小。但是，在实际竞争中，由于各企业之间采取的是不同的基本经营战略，因此各自的经验曲线肯定是不同的，然而采取相同经营战略例如总成本领先战略的企业之间仍然可以进行比较，在溴化锂中央空调行业里，通常情况下，成本低的企业其市场占有率较高，例如江阴双良；或者采用低成本低价格的企业其市场占有率的增长速度较快，例如烟台荏原，作为一家中日合资的企业，具有较高的品牌知名度，其技术水平也较高，和采用差异化战略的行业领先者三洋制冷具有基本相同的经验曲线，但是一直采取低价格的市场竞争战略，以扩大市场占有率为主要目标，到2004年其销售收入已经接近其设定的主要竞争对手三洋制冷，而三洋制冷的年度计划主要围绕着利润指标而制定，且以江阴双良和长沙远大为主要竞争对手，在不同的经验曲线上进行竞争，忽视了潜在的威胁，导致烟台荏原渔翁得利。
累计产量的增加导致单位产品成本下降，这是市场占有率成为在行业中确定一个企业的战略地位的突出的因素。其因果关系是：高市场占有率
高累计产量低单位产品成本高盈利。以溴化锂中央空调为例，江阴双良由于领先进入行业，其累计产量最高，具有较高的市场占有率和较低的单位产品成本，和同样采取总成本领先战略的其它企业相比，其盈利是最高的，这是因为采用的战略基本相同，因此具有相同或者类似的经验曲线。但是，如果两个竞争对手，分别具有不同的产品技术或者所能达到的技术水平不同，则需要注意到由于经验曲线的不同，所能达到的效果也是不同的。例如三洋制冷，在1992年以日本三洋所具有的世界领先的溴化锂产品和技术参与行业竞争，虽然当时江阴双良在市场上占据成本优势，但是三洋制冷以新的技术即以一条不同的经验曲线打入市场，虽然在初始阶段在市场占有率上处于劣势，但是却凭借着性能价格比的优势，迅速扩大市场占有率，在市场上站稳脚跟并迅速发展壮大。
2、经验曲线应用于匡算企业的成本发展趋势
在一些经验效应较大的企业，当考虑投标或者承接一笔较大的订货需要报价时，可以考虑从经验曲线上对成本进行匡算。三洋制冷通常进行报价或者参与招标时，基本上根据标准成本来确定价格，而没有考虑经验曲线的作用，其主要原因是企业难以确定或者说不会确定经验曲线，因此无法进行量化分析，从而只能以静态的相对固定的成本而不是以动态的模拟成本来参与投标竞争，最终有可能因为报价过高而失去订货，反过来又因为订货量增加缓慢而在经验曲线方面处于劣势，陷入一个非良性循环之中。因此，企业虽然在某些具体事项上难以定量分析，但是仍然要考虑根据历年经验，对本企业和竞争对手的成本发展趋势作出近似的预测，进而采取有效的对策。
3、经验曲线应用于经营战略的选择
对于行业的大多数企业，虽然都想具有独自的特点，但是受主客观条件的限制，通常采取的是总成本领先战略，力争利用经验曲线取得成本领先地位，取得经营上的成功，相区别的不同的是企业处于经验曲线上的不同的位置，决定了企业盈利多少或者亏损。而这种途径并不是唯一的，采用差异化战略来发挥产品的特色或者选择集中化战略来瞄准某个局部市场，同样可以取得成功。对于溴化锂中央空调行业，江阴双良采取的是总成本领先战略，作为领先进入该行业的先行企业，其累积产量较大，因此相对于采取相同战略的后入行的企业，在经验曲线上占据较大的优势，可以通过低成本的优势主动降低价格，来提高行业进入壁垒，阻碍新竞争者的加入，维持行业领导地位。在这种成本优势面前，其它企业如何挑战领先者呢？三洋制冷是以日本三洋居于世界领先水平的溴化锂新技术参与竞争，从而具有和江阴双良完全不同的经验曲线，虽然开始时在市场占有率上处于劣势，但由于性能价格比具有优势，就能够以较小的市场占有率取得较高的利润率，从而迅速跻身行业第一集团。而长沙远大则采取集中战略，把主要资源投入到直燃机的细分市场，同样具有独特的经验曲线，也取得了成功，而且由于直燃机在整个溴化锂产品中所占比重最大，以此在行业内具有非常大的影响。
通过上述分析，我们看到，经验曲线可以作为企业经营分析的辅助工具，对企业的经营决策起到一定的作用，但是需要注意的是，过分强调经验曲线的效应，可能给企业带来丧失灵活性的消极后果，也就是说，经验曲线强调增加产量，扩大市场占有率，而忽视技术进步和产品创新，特别是在买方市场的全球化外部环境下，低成本的小批量定制化生产正成为一种新趋向，如何在满足用户多样化需求和降低成本之间取得平衡，是企业面临的一个新的课题，因此要求经验曲线和最新的管理技术相结合，打造出适应本企业发展的竞争策略。

Friday, August 08, 2008

weekend story奥运与战争

英国体育管理人员撤换了业余运动员管理总监，因为2008奥运会中，英国业余选手的获奖太少。

百名老将军真没面子。。

希望奥运会能带来好的改变。。对于弱势群体.

xinhuanews did report the stab of death and injury of two American tourists... The chinese male must has received huge unfair treatment which he has no better target (if he can reach higher officials!)

MoD of UK lost more than 600 laptops in the last 4 years

Telepresence will REDUCE business travel and FAR BETTER than video conferencing

Wednesday, August 06, 2008

fwd:平衡计分卡和企业竞争能力,

如何衡量客户的总价值：未来价值=摩擦，摩擦越大，价值越小

竞争能力：1管理（产品成本控制）2营销能力（销售成本及价格哄抬） 3技术创新

购买行为的类型:复杂的购买行为;simple;习惯性购买行为,寻找品种的购买行为

市场领导者可以通过以下3个途径扩大市场的总规模：
寻找新用户
当产品具有吸引新购买者的潜力时，寻找新用户是扩大市场总规模最简便的途径。
扩大市场总规模的主要策略有：
●新市场战略针对未用产品的群体用户（一个新的细分市场），说服他们采用产品。比如，说服男子采用化妆品。
●市场渗透战略这是对现有细分市场中还未用产品的顾客，或只偶尔使用的顾客，采用降价、劝诱和加大促销力度等方法，促使他们采用产品或是增加使用量。如口服滋补品的营销者强调产品日常保健功能，使顾客认为不是只有患病才要使用。如果平时也使用，就可增加产品消费量。
●地理扩展战略即将
．发现产品的新用途

另外楼主涉及的内容简单的看了下，无非就是数据分析营销与控制商业贿赂。
　　数据营销的外延很大，楼主只说了局部，其实数据营销要根据不同的零售店的定位来选择组合，一般的零售终端都要研究“坪效”“客流量”“客单价”这些基础的数据，在如何采集与研究这些数据上就看经理人的能力了，楼主可以多举点实际操作的案例，以指导书式的语言与大家共享。
　　反商业贿赂的内涵也比楼主所强调的要重要的多，商业贿赂的形式现在也更多样，而国家反商业贿赂最根本的目的不光是为了零售店的投资方，更是为了规范零售的经营环境，保护供应商与厂商。
　　
　　如果有朋友想了解商业，建议多上上联商网

为菲利普·科特勒甚至说——你不是通过价格出售产品，你是出售价格1

購買決策上較常看到的三個模式如下：(一)Engel-Kollat-Blackwell Model(簡稱 EKB 模式)EKB 模式是由 Engel 等(1978)所提出，研究的重點在於強調消費者的決策過程是一個整體的程序而非間斷性的行動。其特色在於以消費者決策過程為中心，解決面臨的問題。經歷需求認知、資訊尋求、方案評估、購買消費至購後行為五個程序。購買決策程序有下列五步驟：1.需求認知(need recognition)：購買決策過程的第一階段，購買者認識到本身的問題或需要的存在。2.資訊蒐集(information search)：購買決策過程的第二階段，被引起購買欲望的消費者會去蒐集更多的資訊。消費者可能只是對資訊有高度的關注，或進行積極的資訊蒐集。3.方案評估(alternative evaluation)：(1)品牌形象(brand image)：對某一特定品牌所持有的信念。(2)評估程序(evaluation procedure)：建立對各品牌的態度，通常消費者會利用一種或以上的評估程序來做產品評估，如品質、大小、價格等。4.購買消費(purchase decision)：購買決策過程中的第四階段，消費者實際上進行產品的購買。購買消費包含是否購買、何時購買、購買什麼、哪裡購買與如何付款？5.購後行為(post-purchase behavior)：購買決策過程中的第五階段，消費者在購買產品後會基於其對購買過程結果所採取的後續行為，其中包含滿意度及購後失調。當消費者在購買產品後，此兩種經驗通常都會進入其記憶中，並影響往後的購買決策，進一步反應於下一個購買程序中。(二)Engel-Blackwell-Miniard Model(簡稱 EBM 模式)EBM 模式由 Engel 等(1993)所倡導，認為一切與消費者購買產品或其過程中，有關的活動與意見。即消費者直接涉及、取得、消費與處置產品或服務的所有活動，包含此類活動前後所引發的決策程序。(三)Kotler ModelKotler(1994)主張外部的行銷刺激與環境刺激，經由消費者黑箱處理的過程，產生購買決策，並且會因為個人的特性與決策過程的不同，產生不同的購買反應，而行銷的任務在於瞭解刺激與消費者的意識中所發生的事件，整個過程涵蓋了環境、個人差異、心理程序三類因素。(四)Howard-Sheth Model
--------------------------------------------------------------------------------
Page 5
5Howard 和 Sheth(1969)提出消費者決策模式(Consumer Decision Model，CDM)。模式主要是由六個基本變項組合而成：資訊(information)、品牌認知(brand recognition)、態度(attitude)、信心(confidence)、購買意願(purchaseintention )和購買(purchase

有了需求，营销者才能将自己的产品出售给市场。从这个意义上说，营销是需求的创造活动。
这个意义上讲，营销的实质就是变潜交换为现实交换的活动和过程
购买名牌产品就可以说是一种典型的惯例化交易。在这样的交易中，营销者与顾客都能节约精力、时间与耗费。因此，出现关系营销的概念。与关系营销相对应，原来的营销则属于交易营销的

。但因为任何一个市场顾客的需要都有差别，单一的生产一种不能满足所有顾客的要求；如果按照顾客的不同要求提供每个顾客满意的产品，有不能满足成本经济性的要求，这样，一个由买方集合起来的市场，在营销者看来，是不能有效进行营销的。因此需要将不同需求差别的顾客区分开
6.参见：【美】菲利普·科特勒：《营销管理》（第九版）梅汝和等译第13页上海人民出版社 Prentice-Hall,Inc 1999年10月出版 ·10·
第一章市场营销与顾客满意
来，这就是细分市场的概念。当细分市场以后，企业就可以根据自己的资源情况、技术专长和竞争能力，选择其中一些细分市场提供产品和服务，被企业选为提供产服务的那些细分市场，就是一个企业的目标市场。细分市场（Segmentation Market）和目标市场(Target Market)是现代营销的核心概念

推销强调和注重的是卖方的需要和利益；营销强调和注重的是买方的需要和利益

波士顿矩阵法，或直接简称为BCG法习惯上以10%的增长率作为高、低增长率的分界线
成长率和市场份额为唯一需要考察因素而其他因素可忽略时，就是BCG法，故此，BCG法完全是GE法的一个特
关系营销与交易营销的主要区别（交替）
交易营销特点
关系营销特点
顾客平均化
顾客个别化
顾客匿名
顾客具名
标准化产品／服务
定制化的产品／服务
大众分销
个别分销
大众化促销
个别刺激
单项信息
双向信息
规模经济
范围经济
市场份额
顾客份额
全部顾客
有赢利的顾客
吸引顾客
维持顾客

平衡计分卡以组织战略为导向，寻找能够驱动战略成功的关键成功因素，并建立与之具有密切因果联系的指标体系来衡量战略实施过程的状态和采取必要的修改以维持战略的持续成功，其工作原理是通过在四个常常冲突的衡量标准中实现平衡而发挥作用，将管理层制定的战略与运作层面的活动整合起来。

具体体现在“四个角度”及其因果关系方面：
首先，根据组织战略，从四个角度设置指标体系。
1.财务角度——目标是解决"股东如何看待我们?"问题。它主要考量管理者的努力是否对企业经济收益产生了积极的作用，因此是其他三个方面的出发点和归宿。财务指标主要包括收入增长指标如销售额、利润额，成本减少或生产率提高指标，资本利用率或投资战略指标等，由于财务数据是有效管理企业的重要因素，因此财务目标大多是管理者优先考虑的目标。

2.顾客角度——目标是解决"顾客如何看待我们?"问题。“顾客满意度的高低是企业成败的关键”，因此现代企业的活动必须以客户价值为出发点，以顾客角度从时间(交货周期)、质量、服务和成本几个方面关注市场份额以及顾客的需求和满意程度来看一个企业。顾客指标体现了企业对外界变化的反映，主要包括市场份额、客户保留度、客户获取率、客户满意度、客户利润贡献率、送货准时率、产品退货率、合同取消数等。

3.内部业务流程角度——目标是解决"我们擅长什么?"问题。它反映企业内部效率，关注导致企业整体绩效更好的，特别是对顾客满意度有重要影响的过程、决策和行动。主要指标有：（1）评价企业创新能力的指标，如新产品开发所用的时间、新产品销售额在总销售额中所占的比例、所耗开发费用与营业利润的比例等；（2）评价企业生产经营绩效的指标，如产品生产时间和经营周转时间、产品和服务的质量、产品和服务的成本等；（3）评价企业售后服务绩效的指标，如企业对产品故障的反应时间和处理时间、售后服务的一次成功率、客户付款的时间等。

4.学习与成长角度——目标是解决“我们是在进步吗?”问题。它将注意力引向企业未来成功的基础，涉及人员、信息系统和市场创新等问题，评估企业获得持续发展能力的情况，主要包括：（1）评价员工能力的指标，如员工满意程度、员工保持率、员工工作效率、员工培训次数等；（2）评价企业信息能力的指标，如信息覆盖率、信息系统反映的时间、当前可能取得的信息与期望所需要的信息的比例等；（3）评价激励、授权与协作的指标，如员工所提建议的数量、所采纳建议的数量、个人和部门之间的协作程度等。根据指标彼此的“因果关系”形成相辅相成的链条，并以兼顾四方面的“平衡”来追求组织的整体效益和健康发展。

尽管平衡计分卡的指标各有特定的内容，但彼此并非孤立、完全割裂的，而是既常常冲突对立又密不可分的。正如卡普兰所言“平衡计分卡的四个维度并不是罗列，学习维度，流程维度。客户维度、财务维度所组成的平衡计分卡既包含结果指标，也包含促成这些结果的先导性指标，并且这些指标之间存在因果关系 ”，这种内部逻辑关系，其根本为投资者需要的财务角度，但投资收益是有一个价值产生过程的，先有员工的创新学习，企业内部管理才有优化的可能和基础，内部管理优化后就能更好地为顾客服务，顾客认可企业的产品和服务，才进行有效消费，企业的价值才能实现，也就有了投资收益。企业发展了一步，产生新情况，又需要员工创新学习，开始下一个循环，由此形成一个完整、均衡的关联指标体系。同时，为了保障战略的有效执行，BSC在评价系统中通过因果关系链整合了财务指标和非财务战略指标，既包括结果指标也包括驱动指标，使其自身成为一个前向反馈的管理控制系统。各指标平衡时，产生良性互动；当某个指标片面偏离目标发生冲突时，协调、沟通、评价机制发挥作用推动财务指标与非财务指标之间，领先指标与落后指标之间，长期指标与短期指标之间，外部指标与内部指标之间达到平衡。

Sunday, August 03, 2008

TESCO会员卡发展

Got a VMWARE certification and you got a job!

VDI (Citrix infrastructure)

Spam bot welcomed (Graveyard, spammer's email will be listed and available for email harvesters)

所有超市雇员都知道的两个名字：the Likert scale（问卷5级答案） and Osgood’s Semantic
Differential Procedures（用多个形容词纬度的评级来描述某事物/概念）. Not though, generally speaking, two concepts
on the lips of every supermarket employee.

早期：Out of the 45,000 lines, 8,500 accounted for 90 per cent

of sales.
Working with that number would inevitably be quicker and

easier, and
common sense suggested that it could yield almost as

much insight as if
the other 36,500 lower-sales-contributing lines were

also included
To the team’s excitement, when they
examined the list of products in each cluster, they

seemed to make sense.
The team settled on 27 different clusters, which became

its first
customer segments. This was given the catchy title of

‘Tesco Lifestyles’.

ways; for
example, there was a ‘Snacking and Lunch Box’ Bucket.

‘Why not turn Lifestyles upside down?’ they reasoned. Take each
product, and attach to it a series of appropriate attributes, describing
what that product implicitly represented to Tesco customers. Then by
scoring those attributes for each customer based on their consistent
shopping behaviour, and building those scores into an aggregate measurement
per individual, a series of clusters should appear that would
create entirely new segments.
新方法：
They then set about imagining 50 things that our shopping baskets might say
about customers. What does it mean if we buy a lot of ready meals? Alot
of fresh produce? No meat? Did we like to try out new products, or
exotic ingredients? Are we motivated by price promotions?
Measuring customers on a number of these criteria could start to create
distinct profiles

the Likert scale and Osgood’s Semantic
Differential Procedures. Not though, generally speaking, two concepts
on the lips of every supermarket employee.

Out of the 45,000 lines, 8,500 accounted for 90 per cent

of sales.
Working with that number would inevitably be quicker and

easier, and
common sense suggested that it could yield almost as

much insight as if
the other 36,500 lower-sales-contributing lines were

also included
To the team’s excitement, when they
examined the list of products in each cluster, they

seemed to make sense.
The team settled on 27 different clusters, which became

its first
customer segments. This was given the catchy title of

‘Tesco Lifestyles’.

ways; for
example, there was a ‘Snacking and Lunch Box’ Bucket.

建立osgood profile，对45000种商品？By creating 20 scales on which to judge the attributes of every
product in the store, it could then create 20 numerical measures. Turning
numbers into insight was becoming a Clubcard speciality.
But what scales to choose? ‘Low fat’ against ‘high fat’, ‘big carton’
against ‘small carton’, ‘needs preparation’ against ‘ready to eat’, and
‘low price’ against ‘high price’ are just a few of the two-tailed Likert
scales that they ended up choosing. There were also single-tailed
measures, such as ‘Is it a promotion?’ and ‘Is it a Major Brand product?’
With 20 scales agreed as a way of grading every product on its shelves,
all that the team had to do was to produce the Osgood Profiles. That is,
45,000 Osgood profiles, one for every product from anchovies to
asparagus, whisky to washing powder. But judging 45,000 products on
20 different scales would mean agreement on 1.2 million individual
ratings before the segmentation could be used.

用滚雪球法，推导商品的osgood属性（先选出最“冒险”的商品，然后看各篮子里相似的商品，挑出一些“比较不可靠”的“冒险”商品，直到“新鲜”更适合分类

had ever tried to distinguish how ‘adventurous’
every product in a supermarket is. Tinned fish probably isn’t; extra
virgin olive oil is. Is Brie adventurous? How adventurous is it? More
than decaffeinated coffee? Less than a red pepper?
They set about devising a way to allocate attributes for every item.
The process created was known as the Rolling Ball. To create a Rolling
Ball categorization, Pavey and his team started with a small set of
products that definitely have the quality you seek: so if you want to find
out which products are adventurous, start with extra virgin olive oil and
ingredients for Malaysian curries, and see which customers bought
those products.
Then look at what else these customers have in their shopping basket.
Discard items that show up in everyone’s basket (bananas or milk, for
example), and keep looking, building bigger and bigger groups of
products. When can the process stop? This is where the rolling ball idea
came in.
The products that are picked up early will have a high ‘adventurous’
rating. As the ball gets bigger, those ratings are probably lower, and
certainly less reliable. So how to stop the ball? Well, the basic idea was
that each of the major attributes were large dips in a huge surface. When
the ball starts to roll into an adjacent hole, then the ball should stop. For
example, you might start off trying to predict adventurous products, but
after 400 or 500 products are coded, you start to find a lot of products
that are more ‘Fresh’ than ‘Adventurous’, and so the ball has started to
roll down an adjacent hole. The mathematics to solve this problem were
challenging, but the method created groups of products that intuitively
seem right.

Each time a cluster became apparent, fewer shoppers remained lost in
20-dimensional space. After six months, 13 well-defined and tested
groups had been identified. But the 14th made no sense.
为了分辨本cluster里的不同子类，增加了一个纬度：购物习惯
To make the segmentation work well, an extra
segmentation was born, Shopping Habits, which used not just what
people bought, but when people shopped.

Saturday, August 02, 2008

慢递

如果不想出门去买东西，寄平信，取别人的东西，可以考虑用慢递服务。。

3天左右？

邮局或者出租车公司可以开展这项增值业务。。。

社会工程学6原则：人的弱点

BBC - BBC Three Programmes - The Real Hustle - Previous episodes:一车多卖
空心大包偷小包
蓝牙播声讯台
冒充邮递员拨声讯台
女孩碰瓷
烧barman的签名钞票（找人看调包，买酒），再赌20镑？

what have u done that for?you came out frim nowhere

battery is flat

fiver: $5

bigger bag catch small bag

charge: wrong car park只要你贴张纸说机器坏了，它就坏了，收钱吧

1.authority
2.liking 招人喜欢
3. reciprocation 互回馈义务
4 consistensy 言行潜意识里会一致
5 social validation 下意识和别人保持一致
6 scarcity 强烈想要（无用的）稀缺资源

如，美国学者的研究认为，一般美国人的核心价值观是：要工作、独立、要结婚、乐于慈善事业、为人诚实。
而中国学者研究认为，中国人的核心价值观是：劳动、成名、节俭、结婚、唯上、慎言。

消费品市场具有以下特征： ·99·
营销管理（第 2 版）
①小型化。这是指消费品是以个人和家庭作为基本购买单位的，因而每次交易的数量与金额相对较少，多属零星购买，购买频率较高。
②分散性。这是由于消费者分布地域广，从城市到乡村，消费都无处不在。
③价格敏感。通常，任何行业市场中的绝大多数购买者，都希望用更低的价格购买到更好的产品。仅管在消费者购买中，存在如“奢侈品”消费——但这一般属于少数极富裕阶层的购买；存在品牌忠诚——但世界上没有不为减价两分而被抵消的品牌忠诚。因此，消费品市场的购买者大都追求尽量低的交易价格。因此凡是那种能够用同样的价格提供比竞争者的产品价值更高的产品，或用更低的价格提供与竞争相同的产品的企业，才能获得更多的营销机会。
④相互影响性。消费者购买产品的相互影响非常明显，因为在大

据赫茨伯格的双因素论，如果一个人的“保健因素”得不到满足的话，会产生破坏性的结果；而如果一个人的“激励因素”得不到满足的话，不会产生破坏性的结果。但是，“保健因素”得到满足，并不会使人更为积极

Crypto, data analysis and BI商业智能，数据挖掘和比特币

Sunday, August 31, 2008

Windmill Coverletter and CV sugggestion

Wednesday, August 27, 2008

fwd 深入探讨数据仓库建模与ETL的实践技巧

Tuesday, August 26, 2008

shame

九步成为演讲高手

fwd: SPSS Clementine Scripts基本语法

Monday, August 25, 2008

拒绝装饰性装修

高建华：不战而胜

Tuesday, August 19, 2008

leading 000 are important in programming

Thursday, August 14, 2008

fwd: IDMER 统计学书籍推荐

Wednesday, August 13, 2008

营销人必读经典营销类电子书籍分享了（转载）(

Friday, August 08, 2008

weekend story奥运与战争

Wednesday, August 06, 2008

fwd:平衡计分卡和企业竞争能力,

Sunday, August 03, 2008

TESCO会员卡发展

Saturday, August 02, 2008

慢递

社会工程学6原则：人的弱点

About Me

Previous Posts

Archives

Crypto, data analysis and BI商业智能，数据挖掘和比特币

Sunday, August 31, 2008

Windmill Coverletter and CV sugggestion

Wednesday, August 27, 2008

fwd 深入探讨数据仓库建模与ETL的实践技巧

Tuesday, August 26, 2008

shame

九步成为演讲高手

fwd: SPSS Clementine Scripts基本语法

Monday, August 25, 2008

拒绝装饰性装修

高建华：不战而胜

Tuesday, August 19, 2008

leading 000 are important in programming

Thursday, August 14, 2008

fwd: IDMER 统计学书籍推荐

Wednesday, August 13, 2008

营销人必读 经典营销类电子书籍分享了（转载）(

Friday, August 08, 2008

weekend story奥运与战争

Wednesday, August 06, 2008

fwd:平衡计分卡和企业竞争能力,

Sunday, August 03, 2008

TESCO会员卡发展

Saturday, August 02, 2008

慢递

社会工程学6原则：人的弱点

About Me

Previous Posts

Archives

营销人必读经典营销类电子书籍分享了（转载）(