Thursday, June 19, 2008

很好的挖掘讨论和资源站

clementine+svm on google.com.cn

KDkeys.net:讨论区

agree. I'm having problems in getting classification RESULT as $null instead of 0 or 1 for flag target variable, would anybody know what's the meanings of it? Does it mean SVM unable to classify so it says "null"?

Well, I am mostly an R and WEKA user, plus I write code for the tricky stuff.

But a few corporate types I have talked to have been muttering Clementine, and I'd like to get a handle on it

a) does it run on a normal PC?

b) can I get an eval?

c) is it GUI, or command driven?

d) if command/language driven then is the syntax like SPSS ?

e) is it aimed at non programmers?

f) is it aimed at non statisticians?

g) does it have any special algorithms in it that I am unlikely to find elsewhere? like ..

h) does it cope with sparse data? (and for instance read Weka's arff format)

i) come to that, does it have a good support vector machine in it?

j) is it really and truly scalable .. if I have 100 million records, and I want to do some bootstrapping or build decision forests, is it going to be up to the task?

Enquiring minds wish to know.

btw, has anyone tried YALE .. a gui front end to weka?


When I was at Bell Canada (just three weeks ago, but it already seems like ages) I had a demo of "Clementine for the Web" circa 2004. Maybe you know that that module was the old NetGenesis, which was all the hype 6 or 7 years ago. Unfortunalty for them, going with SPSS slowly pushed them into oblivion. I believe the main reason was price. If you wanted to do Web Analytics, it was WAY TOO expensive (we didn't know anything about Visual Sciences back then). But if you already were into BI and data mining, it WAS darn cool.

I could answer several of your questions, but I will try to convince SPSS to respond themselves, If they don't, I'll tell youwhat I know (which is what I saw 3 years ago though).

I’m very glad to hear that we are getting mentioned, even in a mutter! (-:
This is a personal reply, not an official reply from SPSS. I have not checked the technical accuracy of every statement (-:

> a) does it run on a normal PC?
Yes. Also there is an add-on server component which runs on various server platforms.
See http://www.spss.com/clementine/system_req.htm for platforms supported by the most recent release, and also the link on this page to previous releases supporting a wider range of server platforms.

> b) can I get an eval?
We don't do a downloadable eval, but if you're interested please contact your local SPSS office (see http://www.spss.com/worldwide/).

> c) is it GUI, or command driven?
It is GUI driven. The data miner creates an executable diagram of the data mining process they want to perform. These diagrams (we call them "streams") are used interactively in an interative fashion to explore the data and build the correct process.

> d) if command/language driven then is the syntax like SPSS ?
There is a scripting language for automating repetitive processes.
You can also execute SPSS syntax within a Clementine stream.

> e) is it aimed at non programmers?
It is aimed at non-programmers, but programmer/non-programmer is perhaps not the most useful distinction here. Clementine is designed for users who want to focus on solving a problem, and finding useful things in the data, rather than on the technical details of algorithms or data management. You can access advanced algorithm features, and you can do complex data management, but you are not forced to see these things all the time. I used to do data mining by cutting code. Now I do it with Clementine, many times faster.
Our customers also comment that they find this much faster to set up than other methods, and also faster to run. It also makes data mining analyses more re-usable – you can open a stream diagram and modify it to meet today’s requirements or to take into account what you have learned. Many of these comments come from people who would have no trouble cutting code – the way that Clementine organizes things is just a whole lot more convenient.

> f) is it aimed at non statisticians?
It is aimed at non-statisticians, but you can use Clementine in conjunction with SPSS statistical products if you need a rich set of statistical tools. Increasingly, Clementine is including techniques beloved of statisticians – logistic, factor/PCA, discriminant, GLM…

> g) does it have any special algorithms in it that I am unlikely to find elsewhere? like ..
Unique algorithms are not the main point of Clementine, but it does have a few less familiar ones - here's a selection:
TwoStep clustering (good for deciding the right number of clusters)
Anomaly detection (based on TwoStep)
GRI (association rules based on Jason Mallen's CUPID)
Sequence (sequential association)
Decision List (interactive rule-building)
Binary classifier (not really an algorithm, rather an automated way of trying many algorithms and parameter settings in one shot)
I understand that our GLM (generalized linear modeling) is also relatively uncommon in data mining tools.

One other notable point about algorithms: if you have access to the in-database algorithms of Microsoft, Oracle or IBM, Clementine can drive many or most of these algorithms as thought they were native to Clementine.

> h) does it cope with sparse data? (and for instance read Weka's arff format)
Interesting question. I don't believe we've ever used anything like a Sparse ARFF file specifically. When dealing with sparse data I tend to use ID,Attribue,Value triplets. Clementine has very rich data manipulation functions, so it's common to switch between different representations as the need arises. This flexibility is one of the things that attracts people to Clementine.

> i) come to that, does it have a good support vector machine in it?
Clementine does not have it own SVM at present, but some of SPSS's alliance partners provide SVMs that can be driven through Clementine (most notably Oracle's ODM).

> j) is it really and truly scalable .. if I have 100 million records, and I want to do some bootstrapping or build decision forests, is it going to be up to the task?
Yes. In particular Clementine Server allows you to leverage parallel hardware, and also pushes work back to (often highly scalable) database systems, making high-volume scoring very practical. We find that some large organizations are switching to Clementine (and away from other enterprise data mining offerings) because it is scalable in a way that others do not seem able to match. I can't speak for bootstrapping or decision tree forests specifically, but Clementine users find it relatively easy to set up complex analyses where each algorithm or model is just a small part of a larger process.

> Enquiring minds wish to know.
Please keep enquiring!

All the best,

0 Comments:

Post a Comment

<< Home