Knowledge Discovery in Databases MIS 637 Professor Mahmoud Daneshmand Fall 2012 Final Project: Red Wine Recipe Data Mining By Jorge Madrazo
Profound Questions • What basic properties are the formula for a good wine? – Wine making is believed to be an art. But is there a formula for a quality wine? – There was a paper on “Modeling wine preferences by Data Mining” submitted by the provider of the data set. How do my results compare with the paper’s?
Procedure • Follow a data mining process • Use SAS and SAS Enterprise Miner to execute the process • SAS Enterprise Miner tool is modeled on the SAS Institute defined data mining process of SEMMA – Sample, Explore, Modify, Model, Assess • SEMMA is similar to the CRISP DM process
Sample • 1,599 records • Set up a data partition – Training 40% – Validation 30% – Test 30%
Explore: Data Background • Data source – UCI Machine Learning Repository. • Wine Quality Data Set. – There are a red and white wine data set. I focused on the red wine set only. – There are 11 input variables and one target variable. » fixed acidity » volatile acidity » citric acid » residual sugar » chlorides » free sulfur dioxide » total sulfur dioxide » density » pH » sulphates » alcohol » Output variable (based on sensory data): quality (score between 0 and 10)
Explore: Target=Quality • Quality – People gave a quality assessment of different wines on a scale of 0-10. Actual range 3-8. – An ordinal target
Explore: Inputs • Correlation Analysis – Some correlation, but not enough to discard inputs • ods graphics on; • ods select MatrixPlot; • proc corr data=wino.red PLOTS(MAXPOINTS=100000 ) • plots=matrix(histogram nvar=all); • var quality alcohol ph fixed_acidity density volatile_acidity sulphates citric_acid; • run;
Explore: Correlation Graphs
Explore: Chi2 Statistics of Inputs
Explore: Worth of Inputs
Explore: Worth Graph • The Worth Tracks closely with the Chi Statistic
Modify • At this stage, no modifications are done
Model: Selection • Because I want to list the important elements in what is considered a quality wine, I choose a Decision Tree • Configuration – The Splitting Rule is Entropy – Maximum Branch is set to 5 • Therefore a C4.5 type of algorithm is being implemented
Assess: Initial Results • A Bushy Tree using. The Resulting tree is too intricate for simple recommendation. – Over 20 Leaf nodes.
Modify: Target • Change the target so that it becomes a binary. • New variable in the model called isGood. Any rating over 6 is categorized as isGood. – SAS Code: data wino.xx; set wino.red; if (quality>6) then isgood=1; else isgood = 0; run; proc print data = wino.xx; title 'xx'; run;
Explore: Target = isGood
Model Strategy for isGood • Model with Decision Tree to hope for more descriptive results. • Also model with Neural Network to aid in assessment and do comparison
Model: Decision Tree • ProbF splitting criteria at Significance Level .2 • Maximum Branch size = 5
Assess: Decision Tree Results • Much simpler Tree
Assess: Decision Tree Results 2
• Leaf Statistics
Assess: Variable Importance Number Number Ratio of Variabl of of Validation Validation to e Splitting Surrogate Importanc Importanc Training Name Label Rules Rules e e Importance alcohol
1
0
density
0
1
volatile_acidity
0
1
sulphates
1
0
fixed_acidity
0
1
citric_acid free_sulfur_dioxi de
0
1
1 0.7705517 5 0.7288689 87 0.6716756 28 0.5537197 29 0.5497503 61
0
0
0
0 NaN
0
0
0 NaN
0
0 NaN
0 0
0 NaN 0 NaN
Event Classification Table pH 0
chloride s 0 0 total_sulfur_dioxi Data Role=TRAIN de Target=isgood 0 0 residual_sugar False True False0 True 0 Negative Negative Positive Positive 53 539 14 34
Data Role=VALIDATE Target=isgood False True False
True
1 0.7705517 5 0.7288689 87 0.4777105 05 0.3938176 71 0.3909945 69
1 1 1 0.711222032 0.711222032 0.711222032
Model: Neural Network • Positive – better at predicting • Negative – hard to interpret the model • Configured with 3 Hidden Nodes
Modify: Input Variables to NN • Because of the complexity of the NN, it is recommended to prune variables prior to running the network.
Modify: R Filter 2
Variable Name Role
Measureme nt Level Reasons for Rejection
alcohol
INPUT
INTERVAL
chlorides citric_acid
INPUT INTERVAL REJECTE D INTERVAL
density
INPUT
Varsel:Small R-square value
INTERVAL
fixed_acidity INPUT INTERVAL free_sulfur_dioxid e INPUT INTERVAL REJECTE pH D INTERVAL REJECTE residual_sugar D INTERVAL
Varsel:Small R-square value Varsel:Small R-square value
sulphates INPUT INTERVAL total_sulfur_dioxi REJECTE de D INTERVAL
Varsel:Small R-square value
volatile_acidity
INPUT
INTERVAL
Model: NN • Specify 3 Hidden Units in the Hidden Layer
Assess: NN Results • Hard to interpret results to formulate a recipe The NEURAL Procedure Optimization Results Parameter Estimates Gradient Objective N Parameter Estimate Function 1 2 3 4 5 6 7 8 9
alcohol_H11 3.679818 -0.001411 chlorides_H11 0.520190 -0.000479 density_H11 -2.171623 0.000883 fixed_acidity_H11 -0.055929 0.000179 free_sulfur_dioxide_H11 0.403412 0.000139 sulphates_H11 -4.954290 -0.000224 volatile_acidity_H11 2.686209 0.000205 alcohol_H12 -0.313005 0.001209 chlorides_H12 0.200973 0.000759
Assess: Comparative Results •
Receiver Operating Characteristics (ROC) Chart for NN vs Decision Tree
Assess: Comparative Results • Cumulative Lift for NN vs Decision Tree
Assess: Comparison with Reference Paper • Used R-Miner • Vector Machine (SVM) and Neural Network used • He applied techniques to extract relative importance of variables • He attempted to predict every quality level • He noted the importance of alcohol and sulphates. “An increase in sulphates might be related to the fermenting nutrition, which is very important to improve the wine aroma.”
Assess: Paper Variable Importance
Overall Project in SAS EM
References • UCI Machine Learning Repository http:// archive.ics.uci.edu/ml/datasets/Wine • P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Systems, Elsevier, 47(4):547-553, 2009. • Modeling wine preferences by data mining from physicochemical properties, Paulo Cortez et. al http://www3.dsi.uminho.pt/pcortez/wine5.pdf