Sunday, December 30, 2007
Friday, December 28, 2007
非常强悍的直销
卖遍云南--“书写王”让营销在讲台上起舞 (谭霁刚,中国营销传播网,2004-12-10)
营销人,特别是像我们一样开直效营销公司的人总要在不同地方跑,一年有大半的时间都是在旅途和旅馆渡过,这难免就会接确一些原本不认识的陌生人,就会有很多所谓的商业信息经过这些本来陌生的人传给我们,这样得来的信息,有可能把人带进天堂,也可能把人推进
营销人,特别是像我们一样开直效营销公司的人总要在不同地方跑,一年有大半的时间都是在旅途和旅馆渡过,这难免就会接确一些原本不认识的陌生人,就会有很多所谓的商业信息经过这些本来陌生的人传给我们,这样得来的信息,有可能把人带进天堂,也可能把人推进
Thursday, December 27, 2007
新年展望:飞机话题继续
二年以前曾经说过这个话题。还是张斗三的说法精辟:搞到经济,再造飞机。
农民朋友们,加油!有梦想的人是不怕嘲讽的。。买彩票的人不是更可笑多了
今年是直升机发明100周年,发现直升机的尾桨很有用哦。否则会机身“打转转”,为什么小时候设想用马达或者电风扇做直升机的时候没有想到这个打转转的反扭矩问题呢?毕竟不是竹蜻蜓
农民朋友们,加油!有梦想的人是不怕嘲讽的。。买彩票的人不是更可笑多了
今年是直升机发明100周年,发现直升机的尾桨很有用哦。否则会机身“打转转”,为什么小时候设想用马达或者电风扇做直升机的时候没有想到这个打转转的反扭矩问题呢?毕竟不是竹蜻蜓
Wednesday, December 26, 2007
Tuesday, December 25, 2007
Java读书杂记
How to build .jar in NetBeans: 1. make sure main class is set; 2. clean and buildJ2ee design patterns by Crawford
EJB目的:抽象persistant mechanism
消息类型:document,event, command
反模式产生的原因:inexperience;不可读代码;copy and paste development,与其贴,不如抽象化之,使得继承
最好用stateless如果可以,节约container的资源使用:Session facade: (a few stateful bean layer hiding more stateless bean?)
Round tripper反模式(误用网络降低调用速度),how to reduce::应该把多次调用合为一,用DataTransferObject和facade,即facade接受一次DTO数据传输对象,然后代替网络调用,执行n次个体方法(getPersonaldata(0), gPd(1)...)
MagicServlet AntiPattern: too many code in a single doGet()..
public class PeopleDTO implements Serialzable { private List people;
public PeopleDTO() { people= new Vector();}
public List getpeople(){ return people;]
public void addPerson(Person person) {people.add(person);}
}
在用facade SessionBean来people.iterator() { LocalEJBperson ejbPerson = (LocalEJBPersin)i.next();
dto.addPerson(new Person(ebjPerson.getFirstName(),ejbPerson.getLastName())); //已经是集合,和不用facade的时候一样用people.add, ONLY changed by wrapped in PeopleDTO
golden hammer:用惯了ejb所以滥用
MVC (model view, controller tier)
乐观锁定模式:复制对象并更新,提交,如果版本号已改,再报错
domain object model: is the business tier and center of app. defines entities which would otherwise in database, and processes which would be scattered. Step1: vision; 2 gather use cases; 3 find object candidates (4 use case)
serialized Entity pattern: turns to binary format so hard for future change of class. enhanced by Tuple table pattern
Get of http1.1 can be cache
process(business) logic for verbs and domain logic for nouns
J2EE:每个container可以有多个web application,每个app应该有一个deployment descriptor web.xml文件申明并映射资源,在/WEB-INF目录
1.Data mining methods and models (Wiley), Figure 2.25 如何强行建立线性关系,用bulging rules 在球第三象限的限制heuristic rule(X down, Y down):
(t-3, t-2, t-1(minus square),t−1/2 ln(t) √t t1 t2 t3 (third power, third order)),对
变量处理后看有没有线性了
2.web dm:将客户数据转化为客户价值
将不赢利客户全部去掉以增加利润不是一个好方法(现有客户的成本会提高,因为有固定成本)
客户信息总在变动:比如住址,这给保留旧信息带来困难
四类客户来源:恳求,激励,自愿,推荐(此类通常双方都是高值,但是数据库中很难记录)
服务电话时间越长,越可能是投诉电话?(我看恐怕是支持电话)
客户价值变动趋势(beta价值)=f(y)/f(x),计算:
y的函数:第一时间段的y值乘1,下一个时间段*2,该加权和减去(x的和乘以y的平均值). 其中
x的和=N*(N+1)/x
X的函数:前N个数的平方和-N个数平均值与总和的积
douobleclick曾经并购Abacus,离线的消费者行为数据提供商
在很多不放置广告的地方放置跟踪器(internet bug/Clear gifs),没有看到任何广告的人的行踪也被跟踪了
有多少企业在招最好的人?(绝对最好,应聘者中最好)。最好的人是否对企业发展最有利?
任何人都只能认识到一部分真理,任何真理都只适用于一部分地方。全部真理也会互相提出。怀疑论,不可能精细描述全部,所以不可能达到最优控制
Credit Susie的挖掘套路:简单实用
The analysis starts with simple statistics of single variables (e.g., mean, std.
deviation). Analysis is then extended to pairs of variables, in order to detect
correlation. One of the variables is typically the target eld. There is a tradeo
in the use of variables appearing with a high correlation to the target variable.
On the one hand, they are suited for building the mining model. On the other
hand, an information leaker might be found that need to be excluded, because
it is likely to produce bad models. The only way to solve this problem is a rst
interaction with business people, discussing the meaning of the high correlation.
Then the modeling process starts. A straightforward way is to build a decision
tree with all non information leaking variables. The data mining suite DARWIN
uses a parallelized implementation of the CART algorithm [1].
Decision trees have the advantage (in contract for example to articial neural
networks) of being able to show understandable rules as well as getting a quick overview of quality of the model
正确的分类也有cost,可用ROC/ TQ
Three categories of costs were estimated. Direct monetary costs are easiest to
estimate: design and printing costs of the mailed documents, cost of the address
list, stu¢ng envelopes, postage. Then, the cost of training the commercial sta¤
for this very marketing programme, negotiation and case management must also
be estimated. Thirdly, the "psychological" or opportunity cost of a misdirected
targeting must be estimated. A client being o¤ered products that dont interest
him can become dissatis ed or annoyed with the bank, and may tend to disregard
J2ee
BEAN WAS DESIGNED TO BE A RE-USE COMPONENTS IN idE, DRAGGABLE AND CAN SEE LIST OF METHODS WITHOUT ACCESS SOURCE CODE
EJB USE RMI
JB serves data request in an orderly fashion (even if they arrive same time)
JAVA BEANS CAN BE USED IN 3 WAYS
IN IDE
IN RMI
SERIALIZED (SVE PROPERTY VALUS AND STATE TO DISK ALLOWS MOVING)
BEAN NEED NO-ARGUMENT CONSTRUCTOR?
CAN HAVE MULTIPLE CONSTRCTOR,
COMPILER WILL CREATE ONE IF THERE ISN'T ANY
Oracle stores numbers as characters so quoted number is inserted as normal number
Java: String str=null //will print as null
Scope (life span): (or page, request, application
J2ee1.3
EJB: Stateless session beans/stateful session beans. 前者可以被Pool,不存客户状态(每次呼叫都要provide),都can't survice after server crash
Entity beeans (saves state to a persistent store
J2ee services: RMI; JDBC, JNDI, JTransServ, jConnA
Access remote EJB: Object robj=jndi.llkup("catalogEJB");
CatalogHome home = (CatalogHome) PortableRemoteObject.narrow(robj, CatalogHome.class);
catalog = home.create()
Structs和service locator pattern都是用于减少客户复杂度
???buidling a spider 800??? where is the link reading???
catbog:阅读不同的公司网站,确定shippting status of a certain shipping track number
EJB目的:抽象persistant mechanism
消息类型:document,event, command
反模式产生的原因:inexperience;不可读代码;copy and paste development,与其贴,不如抽象化之,使得继承
最好用stateless如果可以,节约container的资源使用:Session facade: (a few stateful bean layer hiding more stateless bean?)
Round tripper反模式(误用网络降低调用速度),how to reduce::应该把多次调用合为一,用DataTransferObject和facade,即facade接受一次DTO数据传输对象,然后代替网络调用,执行n次个体方法(getPersonaldata(0), gPd(1)...)
MagicServlet AntiPattern: too many code in a single doGet()..
public class PeopleDTO implements Serialzable { private List people;
public PeopleDTO() { people= new Vector();}
public List getpeople(){ return people;]
public void addPerson(Person person) {people.add(person);}
}
在用facade SessionBean来people.iterator() { LocalEJBperson ejbPerson = (LocalEJBPersin)i.next();
dto.addPerson(new Person(ebjPerson.getFirstName(),ejbPerson.getLastName())); //已经是集合,和不用facade的时候一样用people.add, ONLY changed by wrapped in PeopleDTO
golden hammer:用惯了ejb所以滥用
MVC (model view, controller tier)
乐观锁定模式:复制对象并更新,提交,如果版本号已改,再报错
domain object model: is the business tier and center of app. defines entities which would otherwise in database, and processes which would be scattered. Step1: vision; 2 gather use cases; 3 find object candidates (4 use case)
serialized Entity pattern: turns to binary format so hard for future change of class. enhanced by Tuple table pattern
Get of http1.1 can be cache
process(business) logic for verbs and domain logic for nouns
J2EE:每个container可以有多个web application,每个app应该有一个deployment descriptor web.xml文件申明并映射资源,在/WEB-INF目录
1.Data mining methods and models (Wiley), Figure 2.25 如何强行建立线性关系,用bulging rules 在球第三象限的限制heuristic rule(X down, Y down):
(t-3, t-2, t-1(minus square),t−1/2 ln(t) √t t1 t2 t3 (third power, third order)),对
变量处理后看有没有线性了
2.web dm:将客户数据转化为客户价值
将不赢利客户全部去掉以增加利润不是一个好方法(现有客户的成本会提高,因为有固定成本)
客户信息总在变动:比如住址,这给保留旧信息带来困难
四类客户来源:恳求,激励,自愿,推荐(此类通常双方都是高值,但是数据库中很难记录)
服务电话时间越长,越可能是投诉电话?(我看恐怕是支持电话)
客户价值变动趋势(beta价值)=f(y)/f(x),计算:
y的函数:第一时间段的y值乘1,下一个时间段*2,该加权和减去(x的和乘以y的平均值). 其中
x的和=N*(N+1)/x
X的函数:前N个数的平方和-N个数平均值与总和的积
douobleclick曾经并购Abacus,离线的消费者行为数据提供商
在很多不放置广告的地方放置跟踪器(internet bug/Clear gifs),没有看到任何广告的人的行踪也被跟踪了
有多少企业在招最好的人?(绝对最好,应聘者中最好)。最好的人是否对企业发展最有利?
任何人都只能认识到一部分真理,任何真理都只适用于一部分地方。全部真理也会互相提出。怀疑论,不可能精细描述全部,所以不可能达到最优控制
Credit Susie的挖掘套路:简单实用
The analysis starts with simple statistics of single variables (e.g., mean, std.
deviation). Analysis is then extended to pairs of variables, in order to detect
correlation. One of the variables is typically the target eld. There is a tradeo
in the use of variables appearing with a high correlation to the target variable.
On the one hand, they are suited for building the mining model. On the other
hand, an information leaker might be found that need to be excluded, because
it is likely to produce bad models. The only way to solve this problem is a rst
interaction with business people, discussing the meaning of the high correlation.
Then the modeling process starts. A straightforward way is to build a decision
tree with all non information leaking variables. The data mining suite DARWIN
uses a parallelized implementation of the CART algorithm [1].
Decision trees have the advantage (in contract for example to articial neural
networks) of being able to show understandable rules as well as getting a quick overview of quality of the model
正确的分类也有cost,可用ROC/ TQ
Three categories of costs were estimated. Direct monetary costs are easiest to
estimate: design and printing costs of the mailed documents, cost of the address
list, stu¢ng envelopes, postage. Then, the cost of training the commercial sta¤
for this very marketing programme, negotiation and case management must also
be estimated. Thirdly, the "psychological" or opportunity cost of a misdirected
targeting must be estimated. A client being o¤ered products that dont interest
him can become dissatis ed or annoyed with the bank, and may tend to disregard
J2ee
BEAN WAS DESIGNED TO BE A RE-USE COMPONENTS IN idE, DRAGGABLE AND CAN SEE LIST OF METHODS WITHOUT ACCESS SOURCE CODE
EJB USE RMI
JB serves data request in an orderly fashion (even if they arrive same time)
JAVA BEANS CAN BE USED IN 3 WAYS
IN IDE
IN RMI
SERIALIZED (SVE PROPERTY VALUS AND STATE TO DISK ALLOWS MOVING)
BEAN NEED NO-ARGUMENT CONSTRUCTOR?
CAN HAVE MULTIPLE CONSTRCTOR,
COMPILER WILL CREATE ONE IF THERE ISN'T ANY
Oracle stores numbers as characters so quoted number is inserted as normal number
Java: String str=null //will print as null
Scope (life span):
J2ee1.3
EJB: Stateless session beans/stateful session beans. 前者可以被Pool,不存客户状态(每次呼叫都要provide),都can't survice after server crash
Entity beeans (saves state to a persistent store
J2ee services: RMI; JDBC, JNDI, JTransServ, jConnA
Access remote EJB: Object robj=jndi.llkup("catalogEJB");
CatalogHome home = (CatalogHome) PortableRemoteObject.narrow(robj, CatalogHome.class);
catalog = home.create()
Structs和service locator pattern都是用于减少客户复杂度
???buidling a spider 800??? where is the link reading???
catbog:阅读不同的公司网站,确定shippting status of a certain shipping track number
新闻联播
坚持看了24日的新闻联播,总算没吐出来。
这媒体也实在太名副其实了,最好加一个字,共党中央电视台。
不过中央两个子似乎本身就是中共中央的简称
突然看见结尾字幕“青年文盲”,吓了一跳,不是吧,导播名字都这么牛,再看一遍原来是“上官文青”
国际新闻3条,泰国动向(没对照他信),华南虎(不是故意搞笑的吧)。时间都用在展示老共怎么指点“亲密战友”参政党们上了
这媒体也实在太名副其实了,最好加一个字,共党中央电视台。
不过中央两个子似乎本身就是中共中央的简称
突然看见结尾字幕“青年文盲”,吓了一跳,不是吧,导播名字都这么牛,再看一遍原来是“上官文青”
国际新闻3条,泰国动向(没对照他信),华南虎(不是故意搞笑的吧)。时间都用在展示老共怎么指点“亲密战友”参政党们上了
Monday, December 24, 2007
1957年的英女王耶诞祝词
http://uk.youtube.com/theroyalchannel
非常不显眼的一位王。传达出“因为我们坚持我们所认为正确的理念,才为盟友们承认“的理念。她直白今日她已不制定法令,而只是人民的代表,还有拉一派(打一派)的亲善举动
这种坚信和持续改良,是多么重要。我们,是多么地应该追求。
偶然看到默克尔的言论,才发现,占据道德高点动武,是那么地重要。即使是虚构的道德
A government's foreign and defence policy must be based on values and not interests," Merkel told the Bundestag
非常不显眼的一位王。传达出“因为我们坚持我们所认为正确的理念,才为盟友们承认“的理念。她直白今日她已不制定法令,而只是人民的代表,还有拉一派(打一派)的亲善举动
这种坚信和持续改良,是多么重要。我们,是多么地应该追求。
偶然看到默克尔的言论,才发现,占据道德高点动武,是那么地重要。即使是虚构的道德
A government's foreign and defence policy must be based on values and not interests," Merkel told the Bundestag
f5大谈网络安全
几个话题很有意思,他们自称和思科是竞争伙伴关系。
重视应用安全,认为SSL加密是对诸如SQL injection之类的致命掩护,旁路了所有安全设备,只能用应用逻辑或扫描来处理
(查了一下看到有专利可监测SSL加密过的注入)
对那些应用应该优先,避而不谈(一个聪明的策略),而是说,先做端点安全,然后(暗示用Portal)各应用给予不同访问权限。认为之前的做法(切VLAN)已经过时,这点倒是提醒了我
重视应用安全,认为SSL加密是对诸如SQL injection之类的致命掩护,旁路了所有安全设备,只能用应用逻辑或扫描来处理
(查了一下看到有专利可监测SSL加密过的注入)
对那些应用应该优先,避而不谈(一个聪明的策略),而是说,先做端点安全,然后(暗示用Portal)各应用给予不同访问权限。认为之前的做法(切VLAN)已经过时,这点倒是提醒了我
Sunday, December 23, 2007
易得方舟,视美乐
谁还记得99年火爆北京高校的这些名字?
如今写《清华园的创业启蒙》的作者之一,成了国内最大投资掮客:清科的CEO,另一位作者退出的视美乐,已经不知去向
看来不能找没有经验的投资公司。。
>
第三次接触就在几天后,兴业拿出了一份规范的投资管理顾问方案,提出风险投资的引资运作模式,拿出了帮助融资、管理、成长的计划,同时规范地提出了双方的责、权、利。非常重要的是:这里还有一个小插曲:兴业并不是视美乐当时唯一的可选对象,但它反应迅速、操作规范,尤其是它要做精品、要把视美乐做成“样板戏”的理念使视美乐毫不犹豫地选择了这个把他们作为第一个项目的新公司
如今写《清华园的创业启蒙》的作者之一,成了国内最大投资掮客:清科的CEO,另一位作者退出的视美乐,已经不知去向
看来不能找没有经验的投资公司。。
>
第三次接触就在几天后,兴业拿出了一份规范的投资管理顾问方案,提出风险投资的引资运作模式,拿出了帮助融资、管理、成长的计划,同时规范地提出了双方的责、权、利。非常重要的是:这里还有一个小插曲:兴业并不是视美乐当时唯一的可选对象,但它反应迅速、操作规范,尤其是它要做精品、要把视美乐做成“样板戏”的理念使视美乐毫不犹豫地选择了这个把他们作为第一个项目的新公司
Tuesday, December 18, 2007
虚拟货币兑换
去年偶然想了一下,觉得应该是一个市场。。
突然发现已如雨后春笋般冒出了很多网站和个人。
有点意思
还有很流行的购物返点网
有没有可能搞一个个人虚拟金融门户?讨论一下,甚至提供利息或理财等,还可融合虚拟资产比如网站流量,外包项目等
据说paypal最早也只是为了解决小孩没有信用卡购物的问题
***
后记:发现cn-usa就是类似的网站,甚至提供数字商品买卖服务!
突然发现已如雨后春笋般冒出了很多网站和个人。
有点意思
还有很流行的购物返点网
有没有可能搞一个个人虚拟金融门户?讨论一下,甚至提供利息或理财等,还可融合虚拟资产比如网站流量,外包项目等
据说paypal最早也只是为了解决小孩没有信用卡购物的问题
***
后记:发现cn-usa就是类似的网站,甚至提供数字商品买卖服务!
Labels: 创业
Sunday, December 16, 2007
fwd:好书 及读SQL 2005 DM笔记
Beyond Java, 作者是Bruce Tate,就是那本 Better,Faster,Lighter Java的作者,喜欢炫耀自己在河里泛舟的能力,然后打个蹩脚的比喻。因此每章的开头,我基本都是之间掠过的。这本书还是很有趣的,作者分析了Java语言和其他语言(主要是动态语言)的优势和劣势,探讨了程序设计语言的未来。是一本很薄的小书,190页左右的内容。
3 O’relliey出的那本web service 精髓。这本书前半部分写的蛮不错,后半部分就基本可以忽略不看了——那些工具实在是有些太老了。
4 XML Hacks,同样是O’relliey出品。100个关于XML的Hacks,非常的实用。
5 语义网原理与技术。国内出的书,看了下目录,本以为还是泛泛之谈,看完之后觉得比较精彩。许多之前自己的一些困惑与猜想基本上在这本书中找到了解答。可以看出,作者对语义网的理解还是很透彻的。虽然书中也有很多段落是被许多人早就用烂了的段落——包括那张极其俗的语义网分层结构图,不过应该还算是本好书。
6 单元测试之道,The Pragmatic Starter Kit之第二部,是用JUnit作例子的,因此看起来很快。如果不知道单元测试是怎么回事情,这本书蛮好的。
Data.Mining.with.SQL.Server.2005此书不知所云,做参考手册也显然不够详细,千万别读。
a case table is usually a dimension table, whereas a nested table is a fact table.
The concept of nested cases proposed in OLE DB for Data Mining is
extremely important . It helps you model complicated cases with one-to-many
relationships. It adds lots of expressive power for model building. Without the
nested case concept, you would need to pivot the nested table to case level
attributes during the data transformation stage.
models built with nested cases:
Create mining model MemberCard_Prediction
(
CustomerId long key,
Gender text discrete,
Income long continuous,
MemberCard text discrete predict,
Purchase table (
ProductName text key,
Quantitylong continuous
)
)
Using Microsoft_Decision_Trees
In DMX, a mining model is considered the same as a relational table. Conceptually,
a trained mining model can be considered a truth table.
即某个单元是一个TopCount所选出来的表
As the result of TopCount is a table column, we can apply a subelect clause
to the result of TopCount function (or any prediction function returns table
column). The following is an example:
Select CustomerId, Gender, (Select MemberCard, $Probability as Proba
From TopCount(PredictHistogram(MemberCard), $Probability, 2) As
ProbabilityHistogram).
用Nature Prediction节约列名指定
If the column names of the input case match the column names of the mining
model, you don’t need to specify the On clause in the Prediction Join query.
Instead, you can use Natural Prediction Join. It works for both the batch
prediction query and the singleton prediction query. For example:
Select CustomerId, MemberCard
From MemberCard_Prediction Natural Prediction Join
(Select ‘Male’ As gender, ‘35’ As age, ‘Engineer’ As
Profession,’60000’ As Income, ‘Yes’ As HouseOwner) As customer
SQL Server
Data Mining is a server based solution. 其数据源必须要从server端可访问
In general, when mining on local data, you
should move the data to a SQL Server database using SQL Server Integration
Services (SSIS) before building your models.
Mining structure and Mining model有两种模式,immediate and offline(后者允许cvs,但变动需要publish到server上)
雪花结构是星型结构的普例:Star schemas can be considered special kinds of snowflake schemas,where there is no lookup table.
Measures are the numeric values to be aggregated by the cube. Each measure specifies an aggregate function include Sum, Min, Max, Average, Distinct Count, and so on.
如何建立cube:Acube contains a set of dimensions and measures. There are two steps to processing
a cube: dimension processing (if a dimension hasn’t been processedpreviously) and cube processing. Dimension processing reads dimension data
from underlying dimension tables, builds the dimension structure, creates
hierarchies, and assigns members to proper levels of the hierarchy. After all the
dimensions are processed, cube processing can be started.
The main task of cube processing is to precalculated aggregations based on
the dimension hierarchies. When there are many dimensions and each dimension
contains several levels and many members, the total number of aggregations
could be exponential.
One of the challenges of cube processing is to
choose the optimal number of aggregations to precalculate. Other aggregated
values can be derived from those precalculated measures efficiently. For example,
if the monthly Store_Sales values are preaggregated, quarterly and yearly
Store_Sales values can be derived easily.
统一纬度模型
You can think of a UDM as a combination
of cubes and dimensions.
OLAP对挖掘有用:in many cases, patterns can be found only in aggregated data. It is difficult to discover patterns directly from the fact table.
什么是data mining dimension.
For decision trees, the dimension members represent the tree nodes. The Model_Content schema rowset is like a
parent-child dimension, where each node has a parent node.
If you ask to create a data mining dimension in the DMW, after processing
the mining model, Analysis Services creates a new dimension based on the
mining model content.
The ADO libraries wrap the OLE DB
interfaces into objects that are easier to program against.
data, much as ADO was
created for native languages. The philosophy of ADO.NET is somewhat different
from that of ADO in that ADO.NET is designed to work in a “disconnected”
mode, where data can be accessed and manipulated without
maintaining an active connection
You should use ADOMD.NET when writing data mining client applications
except when .NET is not available.
ability you gain by using the ADOMD.NET object
model is the ability to iterate mining model content in a natural, hierarchical,
manner using objects instead of trying to unravel the flat schema rowset form.
Singletoon( single case prediction)
ADO.net can do everything munually in DM
用存储过程可以避免从client到server查询诸如1000树x1000节点却只有几个符合过滤条件的情况
The clear advantage of ADOMD+ is that all of the content is available on
the server, and you can return only the information you need to the server. You
can call UDFs by themselves, using the CALL syntax or as part of a DMX query.
For example, the following query:
CALL MySprocs.TreeHelpers.FindSplits(‘Generation Trees’,’HBO’)
决策数易忽视高度相关
In many cases, the Microsoft Decision Trees algorihtm performs well for
association tasks. However, decision trees may ignore patterns in some cases.
For example, The GodFather, GodFather II, and Godfather III are highly
correlated. In the tree of GodFather III, the first split is on Godfather II. The
Godfather and Godfather II are highly correlated. Those who like GodFather II
also like The Godfather. Since the split on Godfather II is almost equivalent to
splitting on The Godfather, after the first split, there aren’t any further splits on
The Godfather. The dependency network is generated based on the top three
levels of decision tree splits. As a consequence, no link exists between The
GodFather and Godfather III in the depency network. For associative prediction
queries given The Godfather, the predicted results won’t contain Godfather III,
since this pattern is covered by Godfather II.
This phenomenon is due to the nature of decision trees. It may hide
information if some input attributes are highly correlated. The general
recommendation for this is to try using several different algorithms.
3 O’relliey出的那本web service 精髓。这本书前半部分写的蛮不错,后半部分就基本可以忽略不看了——那些工具实在是有些太老了。
4 XML Hacks,同样是O’relliey出品。100个关于XML的Hacks,非常的实用。
5 语义网原理与技术。国内出的书,看了下目录,本以为还是泛泛之谈,看完之后觉得比较精彩。许多之前自己的一些困惑与猜想基本上在这本书中找到了解答。可以看出,作者对语义网的理解还是很透彻的。虽然书中也有很多段落是被许多人早就用烂了的段落——包括那张极其俗的语义网分层结构图,不过应该还算是本好书。
6 单元测试之道,The Pragmatic Starter Kit之第二部,是用JUnit作例子的,因此看起来很快。如果不知道单元测试是怎么回事情,这本书蛮好的。
Data.Mining.with.SQL.Server.2005此书不知所云,做参考手册也显然不够详细,千万别读。
a case table is usually a dimension table, whereas a nested table is a fact table.
The concept of nested cases proposed in OLE DB for Data Mining is
extremely important . It helps you model complicated cases with one-to-many
relationships. It adds lots of expressive power for model building. Without the
nested case concept, you would need to pivot the nested table to case level
attributes during the data transformation stage.
models built with nested cases:
Create mining model MemberCard_Prediction
(
CustomerId long key,
Gender text discrete,
Income long continuous,
MemberCard text discrete predict,
Purchase table (
ProductName text key,
Quantitylong continuous
)
)
Using Microsoft_Decision_Trees
In DMX, a mining model is considered the same as a relational table. Conceptually,
a trained mining model can be considered a truth table.
即某个单元是一个TopCount所选出来的表
As the result of TopCount is a table column, we can apply a subelect clause
to the result of TopCount function (or any prediction function returns table
column). The following is an example:
Select CustomerId, Gender, (Select MemberCard, $Probability as Proba
From TopCount(PredictHistogram(MemberCard), $Probability, 2) As
ProbabilityHistogram).
用Nature Prediction节约列名指定
If the column names of the input case match the column names of the mining
model, you don’t need to specify the On clause in the Prediction Join query.
Instead, you can use Natural Prediction Join. It works for both the batch
prediction query and the singleton prediction query. For example:
Select CustomerId, MemberCard
From MemberCard_Prediction Natural Prediction Join
(Select ‘Male’ As gender, ‘35’ As age, ‘Engineer’ As
Profession,’60000’ As Income, ‘Yes’ As HouseOwner) As customer
SQL Server
Data Mining is a server based solution. 其数据源必须要从server端可访问
In general, when mining on local data, you
should move the data to a SQL Server database using SQL Server Integration
Services (SSIS) before building your models.
Mining structure and Mining model有两种模式,immediate and offline(后者允许cvs,但变动需要publish到server上)
雪花结构是星型结构的普例:Star schemas can be considered special kinds of snowflake schemas,where there is no lookup table.
Measures are the numeric values to be aggregated by the cube. Each measure specifies an aggregate function include Sum, Min, Max, Average, Distinct Count, and so on.
如何建立cube:Acube contains a set of dimensions and measures. There are two steps to processing
a cube: dimension processing (if a dimension hasn’t been processedpreviously) and cube processing. Dimension processing reads dimension data
from underlying dimension tables, builds the dimension structure, creates
hierarchies, and assigns members to proper levels of the hierarchy. After all the
dimensions are processed, cube processing can be started.
The main task of cube processing is to precalculated aggregations based on
the dimension hierarchies. When there are many dimensions and each dimension
contains several levels and many members, the total number of aggregations
could be exponential.
One of the challenges of cube processing is to
choose the optimal number of aggregations to precalculate. Other aggregated
values can be derived from those precalculated measures efficiently. For example,
if the monthly Store_Sales values are preaggregated, quarterly and yearly
Store_Sales values can be derived easily.
统一纬度模型
You can think of a UDM as a combination
of cubes and dimensions.
OLAP对挖掘有用:in many cases, patterns can be found only in aggregated data. It is difficult to discover patterns directly from the fact table.
什么是data mining dimension.
For decision trees, the dimension members represent the tree nodes. The Model_Content schema rowset is like a
parent-child dimension, where each node has a parent node.
If you ask to create a data mining dimension in the DMW, after processing
the mining model, Analysis Services creates a new dimension based on the
mining model content.
The ADO libraries wrap the OLE DB
interfaces into objects that are easier to program against.
data, much as ADO was
created for native languages. The philosophy of ADO.NET is somewhat different
from that of ADO in that ADO.NET is designed to work in a “disconnected”
mode, where data can be accessed and manipulated without
maintaining an active connection
You should use ADOMD.NET when writing data mining client applications
except when .NET is not available.
ability you gain by using the ADOMD.NET object
model is the ability to iterate mining model content in a natural, hierarchical,
manner using objects instead of trying to unravel the flat schema rowset form.
Singletoon( single case prediction)
ADO.net can do everything munually in DM
用存储过程可以避免从client到server查询诸如1000树x1000节点却只有几个符合过滤条件的情况
The clear advantage of ADOMD+ is that all of the content is available on
the server, and you can return only the information you need to the server. You
can call UDFs by themselves, using the CALL syntax or as part of a DMX query.
For example, the following query:
CALL MySprocs.TreeHelpers.FindSplits(‘Generation Trees’,’HBO’)
决策数易忽视高度相关
In many cases, the Microsoft Decision Trees algorihtm performs well for
association tasks. However, decision trees may ignore patterns in some cases.
For example, The GodFather, GodFather II, and Godfather III are highly
correlated. In the tree of GodFather III, the first split is on Godfather II. The
Godfather and Godfather II are highly correlated. Those who like GodFather II
also like The Godfather. Since the split on Godfather II is almost equivalent to
splitting on The Godfather, after the first split, there aren’t any further splits on
The Godfather. The dependency network is generated based on the top three
levels of decision tree splits. As a consequence, no link exists between The
GodFather and Godfather III in the depency network. For associative prediction
queries given The Godfather, the predicted results won’t contain Godfather III,
since this pattern is covered by Godfather II.
This phenomenon is due to the nature of decision trees. It may hide
information if some input attributes are highly correlated. The general
recommendation for this is to try using several different algorithms.
Labels: 读书笔记
Saturday, December 15, 2007
45度角上升3年的股价
这就是2亿网民的力量。。
腾讯 HK:500, 首日认购158倍,涨幅仅18%,定价3.7港元。如今是70多港元
我认为能上到100。。
由于个人好恶问题,百毒其上市前后没有关注过baidu,这算不算和自己的钱包,经济计划过不去
腾讯 HK:500, 首日认购158倍,涨幅仅18%,定价3.7港元。如今是70多港元
我认为能上到100。。
由于个人好恶问题,百毒其上市前后没有关注过baidu,这算不算和自己的钱包,经济计划过不去
Sunday, December 09, 2007
周末随想:爆破与价值模型
昨天偶然又一次搜到左拉的博克,对这个可爱的家伙(很像我一个故人)再次敬佩,用他独特的理念,他在行走。稚嫩的肩膀,在做比我多的多得探索。
对于蚂蚁事件,也许只能建议破产绝望要自杀的人,研究一些爆破物之类的东东,想好了。
跳楼服毒有鸟用啊,要爬起来,要抗争
价值模型请参见盖茨的哈佛演讲,所谓英雄所见略同也。
对于蚂蚁事件,也许只能建议破产绝望要自杀的人,研究一些爆破物之类的东东,想好了。
跳楼服毒有鸟用啊,要爬起来,要抗争
价值模型请参见盖茨的哈佛演讲,所谓英雄所见略同也。
Friday, December 07, 2007
Wednesday, December 05, 2007
无题
《口岸艾滋病防治管理办法》已经2007年5月30日国家质量监督检验检疫总局局务会议审议通过,自2007年12月1日起施行
《办法》规定,在境外居住1年以上的中国公民,入境时应当到检验检疫机构设立的口岸艾滋病监测点进行健康检查或者领取艾滋病检测申请单,1个月内到口岸检验检疫机构或者县级以上的医院进行健康体检。申请出境1年以上的中国公民以及在国际通航的交通工具上工作的中国籍员工,应当持有检验检疫机构或者县级以上医院出具的含艾滋病检测结果的有效健康检查证明。申请来华居留的境外人员,应当到检验检疫机构进行健康体检,凭检验检疫机构出具的含艾滋病检测结果的有效健康检查证明到公安机关办理居留手续
很久没在网上骂人了。
香港人算不算?
卫生部、公安部关于中国公民出入境提交健康证明的通知
卫检字(89)第5号
(1989年10月9日)
各省、自治区、直辖市卫生厅(局)、公安厅(局)、卫生检疫所:
根据《中华人民共和国公民出境入境管理法实施细则》和《中华人民共和国国境卫生检疫法实施细则》的规定,现就中国公民出境、入境须提供健康证明的有关问题规定如下:
一、中国公民因私事出国,由公安机关在颁发护照时,将卫生检疫所提供的“出入境人员卫生检疫须知”一并交持照人,以协助做好传染病监测的宣传工作。
二、在国外居住三个月以上的国内公民回国,以及经批准回国定居或工作的华侨和港,澳,台同胞入境时,必须出示所在国家或地区的卫生检疫机关或公立医院的健康证明(健康证明包括艾滋病,性病的血清学检查,对没有持健康证明者,入境后到卫生检疫机关进行健康检查.居住所在地的公安机关予以协助,劝其进行体检。
三、经批准出国的劳务,留学,定居人员及其他出国一年以上的人员,出国前到卫生检疫机关办理健康体检,预防接种和签发健康证明.卫生检疫机关对上述人员出境时查验健康证明,对没有健康证明者予以补办。对未办好上述手续者,视情况可以阻止出境,如有必要,边防检查站可给予协助。
四、对中国公民进行健康检查,重点鉴别鼠疫、霍乱、黄热病、艾滋病、性病或者其他传染病,一旦发现必须采取必要措施。
五、中华人民共和国出境入境人员健康证件由中华人民共和国卫生部制定。
六、本通知从一九九零年一月一日起执行
《办法》规定,在境外居住1年以上的中国公民,入境时应当到检验检疫机构设立的口岸艾滋病监测点进行健康检查或者领取艾滋病检测申请单,1个月内到口岸检验检疫机构或者县级以上的医院进行健康体检。申请出境1年以上的中国公民以及在国际通航的交通工具上工作的中国籍员工,应当持有检验检疫机构或者县级以上医院出具的含艾滋病检测结果的有效健康检查证明。申请来华居留的境外人员,应当到检验检疫机构进行健康体检,凭检验检疫机构出具的含艾滋病检测结果的有效健康检查证明到公安机关办理居留手续
很久没在网上骂人了。
香港人算不算?
卫生部、公安部关于中国公民出入境提交健康证明的通知
卫检字(89)第5号
(1989年10月9日)
各省、自治区、直辖市卫生厅(局)、公安厅(局)、卫生检疫所:
根据《中华人民共和国公民出境入境管理法实施细则》和《中华人民共和国国境卫生检疫法实施细则》的规定,现就中国公民出境、入境须提供健康证明的有关问题规定如下:
一、中国公民因私事出国,由公安机关在颁发护照时,将卫生检疫所提供的“出入境人员卫生检疫须知”一并交持照人,以协助做好传染病监测的宣传工作。
二、在国外居住三个月以上的国内公民回国,以及经批准回国定居或工作的华侨和港,澳,台同胞入境时,必须出示所在国家或地区的卫生检疫机关或公立医院的健康证明(健康证明包括艾滋病,性病的血清学检查,对没有持健康证明者,入境后到卫生检疫机关进行健康检查.居住所在地的公安机关予以协助,劝其进行体检。
三、经批准出国的劳务,留学,定居人员及其他出国一年以上的人员,出国前到卫生检疫机关办理健康体检,预防接种和签发健康证明.卫生检疫机关对上述人员出境时查验健康证明,对没有健康证明者予以补办。对未办好上述手续者,视情况可以阻止出境,如有必要,边防检查站可给予协助。
四、对中国公民进行健康检查,重点鉴别鼠疫、霍乱、黄热病、艾滋病、性病或者其他传染病,一旦发现必须采取必要措施。
五、中华人民共和国出境入境人员健康证件由中华人民共和国卫生部制定。
六、本通知从一九九零年一月一日起执行