Aller au contenu principal
Data Source
NA
Organizer
Gary Saarenvirta, formerly of The Loyalty Group, now with IBM Canada

 

Introduction

 


Data mining is the process of discovering previously unknown, actionable and profitable information from large consolidated databases and using it to support tactical and strategic business decisions.
 

The statistical techniques of data mining are familiar. They include linear and logistic regression, multivariate analysis, principal components analysis, decision trees and neural networks. Traditional approaches to statistical inference fail with large databases, however, because with thousands or millions of cases and hundreds or thousands of variables there will be a high level of redundancy among the variables, there will be spurious relationships, and even the weakest relationships will be highly significant by any statistical test. The objective is to build a model with significant predictive power. It is not enough just to find which relationships are statistically significant.
 

Consider a campaign offering a product or service for sale, directed at a given customer base. Typically, about 1% of the customer base will be "responders," customers who will purchase the product or service if it is offered to them. A mailing to 100,000 randomly-chosen customers will therefore generate about 1000 sales. Data mining techniques enable customer relationship marketing, by identifying which customers are most likely to respond to the campaign. If the response can be raised from 1% to, say, 1.5% of the customers contacted (the "lift value"), then 1000 sales could by achieved with only 66,666 mailings, reducing the cost of mailing by one-third.
 

Suggestions for Analysis


We expect that you will be using Splus or SAS for the analysis, however, not all of the methods suggested here are readily available in Splus. If you have a SAS licence, the Enterprise Miner module will conveniently automate many of the analyses and you may be able to get an evaluation copy inexpensively from SAS. IBM's Intelligent Miner is also recommended but it is less likely to be available to you.
 

For all the analyses below, you should create a training set and a validation set. As the data were stratified to 50/50 you should create an unstratified validation set with the original proportion of 1% "True" for the objective variable. You would, of course, get better validation sets if you had the complete sample of around 100,000 accounts, 99% of them non-responders, but the file is too large for us to distribute conveniently. Validation sets constructed from the 50/50 stratified sample should be adequate for the purposes of this exercise.
 

Your results should be plotted on a gains chart, either tabular or graphical. A gains chart is a plot of the % of the responders reached (ordinate) against the % of the customer base contacted (abscissa). If the campaign is directed at randomly-chosen individuals the plot will be a straight line with unit slope through the origin. If the campaign preferentially targets responders, the gains curve will lie above the diagonal except, of course, at 0% and 100% where it necessarily touches.
 

The performance of a predictive model is measured by looking at the % responders at 10%, 20% or 30% of customers mailed. A good model will get 1.5 to 3.5 times as many as random over this range so, for example, mailing to 10% of the customer base will reach 15% to 35% of the responders. Less than this means the data are not very predictive, more than this likely means that you have overfitted or there is a strong bias in the data.
 

Some things you could try with these data include:

  1. Try some simple linear correlations, Spearman and Pearson, against the objective variable and reduce the number of variables. With the reduced set of variables, build logistic regression models. Don't forget to remove colinear variables.
  2. Break the variables into blocks of 10-20 and build logistic models on each of the blocks. After all the models are built, pool the variables that were left in the models and create new blocks of 10-20 and redo until there is only one block of variables left. Don't forget to remove colinear variables.
  3. Create PCA factors from the set of variables (don't include the objective variable!). Select a reduced set of variables from the PCA factors (using cumulative % of variation explained) and build a model from the factors. Compare this result with using all the factors, noting the effect of overfitting. Don't forget to remove colinear variables.
  4. Perform a varclus with all variables. This procedure clusters the variables into hierarchical groups using the PCA factors. Select variables from the bottom-level of the hierarchical groups and build a logistic model. Don't forget to remove colinear variables.
  5. Create multiple training and test samples. Use bootstrapping to estimate the error bounds on the model coefficients and gains chart performance. Try sampling with and without replacement to see how sensitive logistic regression is to the data set configuration.
  6. Use SAS to construct a Radial Basis Function regression. Use all the above methods to reduce the variable set and compare RBF results to logistic.
  7. It is possible to implement a decision tree with SAS using the CART algorithm. Run this algorithm against all the variables. Build multiple training sets using sampling with replacement. This should improve the tree performance by a few percent.
  8. Other modeling techniques to try include neural networks and genetic algorithms.

All model results should be analyzed for gains chart performance with the following measures:

  1. What is the response rate (as % of responders in the customer base reached), compared to random, for campaigns mailed to 10%, 20% or 30% of all customers? At these points, the random response would be 10%, 20% or 30% respectively. Most campaigns are mailed to 10% to 30% of the customer base. Good models can achieve 1.5 to 3.5 times the random response rate in this range.
  2. Monotonicity and smoothness: do the response rates by quantile group form a smoothly decreasing profile? Any waviness is indicative of bias, overfitting or unmodelled effects.
  3. Ease of model explanation. It is very important for prospective clients to understand why the model is working!


Resources

You can get copies of some of Gary Saarenvirta's work online at www.db2mag.com. He has also written The Intelligent Miner for Data Applications Guide, found at www.redbooks.ibm.com.
 

An Internet search on "data mining" will find a number of commercial products similar to Intelligent Miner and Enterprise Miner.
 

 

Research Question

NA
 

Variables

Chaque cas représente un compte. Les numéros de compte ont été supprimés.
 

La variable objective est une variable de réponse indiquant si un consommateur a réagi ou non à une campagne de mailing direct pour un produit spécifique. "Vrai" ou "réponse" est représenté par 1, "Faux" or "non-réponse" par 0.
 

Les données sont extraites d'un jeu de données beaucoup plus large dont le taux de réponse était d'environ 1%. Nous avons utilisé les 1 079 personnes de ce groupe ayant réagi, ainsi que 1 079 personnes n'ayant pas réagi, choisies au hasard, soit un total de 2 158 cas.
 

Le fichier contient 200 variables explicatives : v137, v141 et v200 sont des indicateurs de sexe "homme," "femme" ou "inconnu," respectivement, et v1-v24, v138-v140 et v142-v144 sont la récence, la fréquence et des données de type monétaire pour les comptes spécifiques ; v25-v136 sont des variables de recensement et v145-v199 sont des variables démographiques "taxfiler" provenant des contribuables. La majorité des variables ont été normalisées.
 

Une table contenant certaines descriptions de variables est jointe. Certaines variables spécifiques au produit ont été masquées. "p##" signifie produit, "rcy" signifie la récence (nombre de mois depuis la transaction la plus récente), "trans" signifie le nombre de transactions, "spend" signifie le montant dépensé en dollars. Par exemple : p01rcy signifie récence du produit 1. Notez qu'une récence zéro signifie que le compte était actif pour le produit en question au cours du mois le plus récent. "Jamais actif" serait indiqué par la plus grande valeur de récence possible, sur la base du premier mois pour lequel l'entreprise a recueilli des données.
 

Les variables de recensement et de contribuables sont des statistiques sommaires pour la zone de dénombrement dans laquelle l'adresse du titulaire du compte est située. Elles indiquent généralement les nombres totaux ou moyens d'individus ou de familles ou de dollars comprises dans les catégories indiquées. Une table contenant certaines descriptions de variables de contribuables est jointe. Vous devriez pouvoir deviner la plupart des variables de recensement d'après leurs noms, mais des tables contenant des descriptions plus longues des variables sont jointes : Groupe "a" et Groupe "b" le sont présentés séparément. N'hésitez pas à nous contacter si vous avez des doutes.
 

Vous pouvez télécharger les données sous forme de classeur Excel 97/98 gary.xls (5.9 Mo), sous forme de classeur Excel 97/98 comprimé en archive ZIP gary_xls.zip (2.4 Mo), or sous la forme de fichiers texte dans une archive ZIP gary.zip (1.3 Mo).
 

gary.zip contient deux fichiers texte. Les données sont présentées dans un fichier ASCII à largeur fixe Sasvar.txt et la description des fichiers de données est dans un fichier imtrans.txt. Si vous décidez de travailler avec les fichiers texte, vous DEVEZ utiliser les positions de colonnes du fichier imtrans.txt pour importer les données dans SAS ou Splus car certaines colonnes sont contiguës. Faites attention aux fins de lignes ; si par exemple, vous dézippez les fichiers texte sous UNIX vous devrez tenir compte de la présence d'un caractère de changement de ligne à la fin de chaque ligne lorsque vous calculerez la longueur de l'enregistrement.
 

Data Files