Sunday, December 16, 2007

fwd:好书 及读SQL 2005 DM笔记

Beyond Java, 作者是Bruce Tate,就是那本 Better,Faster,Lighter Java的作者,喜欢炫耀自己在河里泛舟的能力,然后打个蹩脚的比喻。因此每章的开头,我基本都是之间掠过的。这本书还是很有趣的,作者分析了Java语言和其他语言(主要是动态语言)的优势和劣势,探讨了程序设计语言的未来。是一本很薄的小书,190页左右的内容。

3 O’relliey出的那本web service 精髓。这本书前半部分写的蛮不错,后半部分就基本可以忽略不看了——那些工具实在是有些太老了。

4 XML Hacks,同样是O’relliey出品。100个关于XML的Hacks,非常的实用。

5 语义网原理与技术。国内出的书,看了下目录,本以为还是泛泛之谈,看完之后觉得比较精彩。许多之前自己的一些困惑与猜想基本上在这本书中找到了解答。可以看出,作者对语义网的理解还是很透彻的。虽然书中也有很多段落是被许多人早就用烂了的段落——包括那张极其俗的语义网分层结构图,不过应该还算是本好书。

6 单元测试之道,The Pragmatic Starter Kit之第二部,是用JUnit作例子的,因此看起来很快。如果不知道单元测试是怎么回事情,这本书蛮好的。



Data.Mining.with.SQL.Server.2005此书不知所云,做参考手册也显然不够详细,千万别读。



a case table is usually a dimension table, whereas a nested table is a fact table.

The concept of nested cases proposed in OLE DB for Data Mining is
extremely important . It helps you model complicated cases with one-to-many
relationships. It adds lots of expressive power for model building. Without the
nested case concept, you would need to pivot the nested table to case level
attributes during the data transformation stage.

models built with nested cases:
Create mining model MemberCard_Prediction
(
CustomerId long key,
Gender text discrete,
Income long continuous,
MemberCard text discrete predict,
Purchase table (
ProductName text key,
Quantitylong continuous
)
)
Using Microsoft_Decision_Trees

In DMX, a mining model is considered the same as a relational table. Conceptually,
a trained mining model can be considered a truth table.


即某个单元是一个TopCount所选出来的表
As the result of TopCount is a table column, we can apply a subelect clause
to the result of TopCount function (or any prediction function returns table
column). The following is an example:
Select CustomerId, Gender, (Select MemberCard, $Probability as Proba
From TopCount(PredictHistogram(MemberCard), $Probability, 2) As
ProbabilityHistogram).

用Nature Prediction节约列名指定

If the column names of the input case match the column names of the mining
model, you don’t need to specify the On clause in the Prediction Join query.
Instead, you can use Natural Prediction Join. It works for both the batch
prediction query and the singleton prediction query. For example:
Select CustomerId, MemberCard
From MemberCard_Prediction Natural Prediction Join
(Select ‘Male’ As gender, ‘35’ As age, ‘Engineer’ As
Profession,’60000’ As Income, ‘Yes’ As HouseOwner) As customer


SQL Server
Data Mining is a server based solution. 其数据源必须要从server端可访问
In general, when mining on local data, you
should move the data to a SQL Server database using SQL Server Integration
Services (SSIS) before building your models.

Mining structure and Mining model有两种模式,immediate and offline(后者允许cvs,但变动需要publish到server上)

雪花结构是星型结构的普例:Star schemas can be considered special kinds of snowflake schemas,where there is no lookup table.

Measures are the numeric values to be aggregated by the cube. Each measure specifies an aggregate function include Sum, Min, Max, Average, Distinct Count, and so on.

如何建立cube:Acube contains a set of dimensions and measures. There are two steps to processing
a cube: dimension processing (if a dimension hasn’t been processedpreviously) and cube processing. Dimension processing reads dimension data
from underlying dimension tables, builds the dimension structure, creates
hierarchies, and assigns members to proper levels of the hierarchy. After all the
dimensions are processed, cube processing can be started.

The main task of cube processing is to precalculated aggregations based on
the dimension hierarchies. When there are many dimensions and each dimension
contains several levels and many members, the total number of aggregations
could be exponential.

One of the challenges of cube processing is to
choose the optimal number of aggregations to precalculate. Other aggregated
values can be derived from those precalculated measures efficiently. For example,
if the monthly Store_Sales values are preaggregated, quarterly and yearly
Store_Sales values can be derived easily.

统一纬度模型
You can think of a UDM as a combination
of cubes and dimensions.

OLAP对挖掘有用:in many cases, patterns can be found only in aggregated data. It is difficult to discover patterns directly from the fact table.


什么是data mining dimension.
For decision trees, the dimension members represent the tree nodes. The Model_Content schema rowset is like a
parent-child dimension, where each node has a parent node.
If you ask to create a data mining dimension in the DMW, after processing
the mining model, Analysis Services creates a new dimension based on the
mining model content.

The ADO libraries wrap the OLE DB
interfaces into objects that are easier to program against.


data, much as ADO was
created for native languages. The philosophy of ADO.NET is somewhat different
from that of ADO in that ADO.NET is designed to work in a “disconnected”
mode, where data can be accessed and manipulated without
maintaining an active connection

You should use ADOMD.NET when writing data mining client applications
except when .NET is not available.

ability you gain by using the ADOMD.NET object
model is the ability to iterate mining model content in a natural, hierarchical,
manner using objects instead of trying to unravel the flat schema rowset form.

Singletoon( single case prediction)

ADO.net can do everything munually in DM

用存储过程可以避免从client到server查询诸如1000树x1000节点却只有几个符合过滤条件的情况
The clear advantage of ADOMD+ is that all of the content is available on
the server, and you can return only the information you need to the server. You
can call UDFs by themselves, using the CALL syntax or as part of a DMX query.
For example, the following query:
CALL MySprocs.TreeHelpers.FindSplits(‘Generation Trees’,’HBO’)


决策数易忽视高度相关
In many cases, the Microsoft Decision Trees algorihtm performs well for
association tasks. However, decision trees may ignore patterns in some cases.
For example, The GodFather, GodFather II, and Godfather III are highly
correlated. In the tree of GodFather III, the first split is on Godfather II. The
Godfather and Godfather II are highly correlated. Those who like GodFather II
also like The Godfather. Since the split on Godfather II is almost equivalent to
splitting on The Godfather, after the first split, there aren’t any further splits on
The Godfather. The dependency network is generated based on the top three
levels of decision tree splits. As a consequence, no link exists between The
GodFather and Godfather III in the depency network. For associative prediction
queries given The Godfather, the predicted results won’t contain Godfather III,
since this pattern is covered by Godfather II.
This phenomenon is due to the nature of decision trees. It may hide
information if some input attributes are highly correlated. The general
recommendation for this is to try using several different algorithms.

Labels:

0 Comments:

Post a Comment

<< Home