建两个同样的数据源(哪怕merge同一表时),可以加快
http://www.kdkeys.net/forums/thread/7255.aspx
As a small Clementine user tip, try using multiple identical database source nodes if you are merging database tables (even different rows of the same database table). Don't access the data in one database source node and use Clementine streams to split the data and re-join back later. This will prevent SQL pushback and force Clementine to write temporary data to disk.
对聚类结果用性别,某状态等overlay来看各群中该状态在各点的值
As a general tip. I usually do clustering on customer behaviour data only (for me this is mobile or fixed line phone usage) and then 'colour'/overlay the clusters by socio-economic attributes (age, household income, number of children, marriage status etc etc..) in order to identify how the customers in each cluster are. You might also want to apply market research and surveys to samples of customers from each cluster to further enrich your understanding of the clusters - but only after clustering customer behaviour only. Well, that's my preference
我如何察看结果中是否有重复:用distinction,或aggregate,record count>1即重复。
如何察看2个结果的差集:用merge的反连接,注意第一个集要大于第二个才有结果
Am I under the "Curse of Dimensionality"?
Reply Quote Favorites Contact
Right now my Clementine is running a process, I've been working 12 hours straight with just some breaks for eating and going to the bathroom.
My objective is to assign a probability number to 1.750.000 records, being the probability of adquiring a product next month. Each month, only 5.000 of those records adquires te product. I have 150 usable fields, with many different kind of distributions, storage class, types, and so on. I even have one set field with 100 different values.
One of the things i've done is combine 4 months of history, getting about 500 fields (there is no relevant history for some). I derived new fields (a very, very long duty), for example, the mean of the 4 months, the delta X between the mean of the first 3 months and the last month, set values with for example 8 values for describing historical behaviours of flag fields (like getting 1 - 0 - 0 - 1 or 0 - 1 - 1 - 0, for example), and i arrived to about 250 fields. When using feature selection i could screen half of those 250 (importance: Important) and end up with 125 fields for a neural network.
I can use the neural network to get 88% accuracy with the 5.000 buyers and 20.000 sample non-buyers... but of course, when i test the model on the whole database it's just useless, i get many many non buyers with a near-to-1 value in the probability field (softmax method).
I really don't know what to do!!! I can't even try to find correlations between fields because there are so many, it's a very stressing work, imagine how could someone end if he sees a graph between every field, having 125 fields. And believe me i've tried, but i've got nothing.
I thought of using PCA Factor but using it without any data preparation i get no improvement in the modeling. And if i want to normalize in a scale from 0 to 1 every single variable............................... remember i told you i have them of all flavours, i would get mad. I can't even bin because i don't get stable categories and i can't trust in getting the same categories for another period of months.
I've already described the magnitude of my database... am i doing something wrong? What would you do in my place? Even if i managed to reduce the dimensionality of the fields and take it to a reasonable number for a single person (me) to explore... is it really possible to generate a good model in the context i described?
I'll repeat it: i have data from every month. About 1.750.000 records. About 5.000 of those buy the product "A" (for example, an insurance policy) every month. I have to assign a probability to each record for buying the product in the next month, a probability good enough so that if i say that these 100.000 clients have %50 chance of buying it, about 50.000 buys.
If you watch Lost... and remember Hurley needing someone to tell him he was cursed... well... i'm feeling just like him. I can't deal with this.
Thank you very much.
Report abuse Quick Reply
05-23-2007, 18:35 7259 in reply to 7255
TimManns
Joined on 01-09-2004
Australia
Diamond Member
Re: ¿Am I under the "Curse of Dimensionality"?
Reply Quote Favorites Contact
Hi,
Always tricky to help with these types of questions...I'll try...
What is your gains or lift chart of the scored Neural Net model like?
It sounds like you are doing similar stuff to some of my monthly tasks. Your data processing steps sound exhaustive (in a good way) and everything sounds sensible.
I run prediction models, whereby I assign a probablity of churn and also probablities to churn to each specific competitor (we have maybe just 3 competitors) (particular customers of certain age, demographics and behaviour profile are more likely to go to certain competitors). I don't get brilliant results for the competitor probablities, but the gains charts are acceptable. In my gains charts, at the 25-30% customer base point we have a gain of 60% (that's double the random of 30%). Our lift charts are pretty good.
To be honest, does it really matter what your classification prediction is? As long as your top n% of scored data has a much higher incidence of the correct outcome. I have some projects where my scored data predicts 20 or 30% incidence of my outcome, but only 3 or 5% incidence actually occurs (I'm talking about churn btw). The top 5% of my scored data contains a very high proportion of the actual outcomes, so I'm happy. Our marketing campaigns only ever use at most the top 10%. My model accuracy over the whole base is maybe just 65-70% because I'm over predicting (false postives).
If you are getting a good looking lift or gains charts, then use this to present your results. Any campaigns to target customers should be selectively contacting the top n% from your base. If your top 3% probablities actually contain a lot of the target outcomes then you are doing fine.
In your case, you don't have to target every customer predicted above 50% chance. If you know that about 50,000 buys should occur, then simply sort by probablity to buy in descending order and target your top 50,000 customers (or 100k to catch any leftovers).
Weighting CART and using miss-classifications costs in C5 might help you use a little more data when building your model, and could help ensure you provide probablities and predictions near the actual level of incidence. Using slightly less balanced data for a Neural Net could help, but watch out for the Neural Net just giving one outcome.
i hope this helps a little...
Tim
Report abuse Quick Reply
05-23-2007, 21:54 7260 in reply to 7259
Arkantos
Joined on 05-07-2007
Bronze Member
Re: ¿Am I under the "Curse of Dimensionality"?
Reply Quote Favorites Contact
Tim i'm very very thankful for your help, it's really good to have someone on the other side answering questions.
Before i found this forum (... before i found you) i could just read and read books and stuff on the net and whatever i could find to learn, but it's so, so good to have a teacher.
Again, BIG THANKS, i'll get to work and i'll post again.
Report abuse Quick Reply
07-08-2007, 7:05 7322 in reply to 7260
JeffZanooda
Joined on 07-08-2007
New Member
Re: ¿Am I under the "Curse of Dimensionality"?
Reply Quote Favorites Contact
Did you adjust for the difference in response rate between your training sample and the entire population? Your population response rate is 5,000/1,750,000 = 0.29%, while sample response rate is 5,000/25,000 = 20%. Otherwise the model will overestimate the probability of response.
For logistic regression this is usually done by adjusting the intercept. Alternatively, if your software allows it you can attach weight of 70 to the non-responders.
Report abuse Quick Reply
07-08-2007, 19:47 7323 in reply to 7322
Arkantos
Joined on 05-07-2007
Bronze Member
Re: ¿Am I under the "Curse of Dimensionality"?
Reply Quote Favorites Contact
Jeff: thanks for your answer.
I'm using neural networks and I can adjust Alpha, Initial Eta, High Eta, Eta Decay and Low Eta. Adjusting any of there parameters should help to teach the model to expect much less response in real deployment?
I've been trying several response rates (leaving always 5,000 true values but changing the false values) for the modeling step, and I found out that the best models come from the lowest response rates (the closer to the real ones). Simple logic tells me that if I use even more false values I should get better models. In the future I will try this just to experience the result, as there are two drawbacks:
1) I get a very small amount of true values in the prediction.
2) The modeling time increases exponencialy.
Drawback number one has simple avoidance: I shouldn't care if I don't get true values in the prediction. As long as my gain charts are better than the others, then it's a good model. I just need the confidence value to classify the entries. Drawback number two also has simple avoidance: press Execute and go to sleep. So I'll guess I'll be trying this as soon as possible.
Thanks again.
Best regards.
Report abuse Quick Reply
07-08-2007, 21:21 7327 in reply to 7323
TimManns
Joined on 01-09-2004
Australia
Diamond Member
Re: ¿Am I under the "Curse of Dimensionality"?
Reply Quote Favorites Contact
This is a fyi...
I build a predictive model to identify likely churners for the subsequent month. I score a proportion (higher spenders) of our mobile customer base every month (for example on June 20th, forecasting churn for the whole month of July). Let’s say this is approx 3 million rows. Those customers with the highest churn score are contacted with a retention offer within the next few days. This might only be 10k customers, depending upon the available budget and workload. Contact methods change too, sometimes the customer may be called, other times a letter is sent. Our campaign delivery team are very fast so we have this luxury of a quick turn-around.
The model (a neural network) was built months ago using a sample of approx 10k churners and 20k active (random) customers. It is important that you balance the data prior to building the model. I normally update (re-build) the predictive model every few months as necessary. My current model as been performing well for 5-6 months now because we have not had any big changes in our market.
My predictive churn model predicts approx 8% churn each month. This is far higher than our actual churn rate, but the churn score is used to order the customers by ‘churn probability’, and this places the most likely churners at the top of the list. For this reason it doesn’t matter too much that the predictive model over estimates churn incidence.
I don’t play around with the neural network options much, preferring instead to apply comprehensive data manipulation. The final data set that I present to the predictive model is several hundred columns wide, containing an array of transformed customer call related data and account information. The data processing time for my analysis is approx 4 hours, involving accessing tables containing 70 million rows of call usage data per day. After the data transformations are complete the scoring of the customer base through the neural network takes approx 10 mins. This is because I have configured my Clementine stream to run the neural network as SQL, and all processing load occurs in our Teradata warehouse (a we have a huge DB system). I always ensure that any single analysis job can be completed within a working day, otherwise we consider it infeasible.
It sounds as though you have a similar process in place :)
I hope this helps
Tim