Data Preparation: Get the Most out of Your Data, Big and Small IEEE San Diego Chapter Lecture Tamara B. Sipes, Ph.D. August 17th, 2016
Agenda Introduction Data Preparation : Why? Data Preparation : How? Challenges and Solutions Conclusions & Future Directions
Introduction o o o o
Background Evolution Paradigms and Tools The Process
Best Job in America? •
Data Science • Extraction or “mining” of knowledge and actionable insights from data • Data-driven discovery and modeling of hidden patterns (we never knew existed) in large and/or complex data • Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data
Multidisciplinary Field Artificial Intelligence Statistics
Machine Learning
Predictive Analytics Big Data
Data Science Databases
High Performance Computing
Sipes (2016)
History and Jargon • • • •
Emerged late 1980s Flourished in 1990s Multidisciplinary field Roots traced back along three family lines: o Classical Statistics o Artificial Intelligence o Machine Learning
• Pattern Recognition • Database Mining • Knowledge Discovery from Databases (KDD) • Data Mining • Knowledge Extraction • Knowledge Mining • Predictive Analytics • Advanced Analytics • Intelligent Analytics • Business Intelligence • Most recently: Data Science
Major Paradigms Data Type
• Supervised • Unsupervised • Semi-Supervised
Data Availability
• Incremental • Non-incremental
Output Type
• Descriptive • Predictive • Prescriptive
Basic Methods • Clustering: k-means, EM, hierarchical • Association Rules • Bayesian Learners • Instance-Based Learning • Regression • Classification Rules: List, Table, Decision Trees • Numeric Prediction: Regression and Model Trees • Random Forests • Hidden Markov Model (HMM) • Artificial Neural Networks (ANN) • Support Vector Machines (SVM)
Advanced Methods • Semi-Supervised Learning • Anomaly Detection Methods • Ensemble Learning • Deep Learning
CRISP-DM Methodology • Cross Industry Standard Process for Data Mining o http://www.crisp-dm.org/
• Six Phases: o o o o o o
Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment
Data Preparation - Why? o State of Affairs o Motivation and Importance o Definition and Goal
Data Preparation – 80%
www.forbes.com , #1418443d7f75, March 23, 2016
The Least Favorite Part
www.forbes.com , #1418443d7f75, March 23, 2016
Motivation and Importance • Data in the real world is far from clean o incomplete: lacking values, lacking attributes of interest, or aggregated o noisy: containing errors or outliers o inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results “Garbage in, garbage out” • Need to know what to do with the “unclean data” • A crucial step of any predictive analytics project • Could be the difference between a successful project and a failure
Definition • Data Preparation is a process of: cleaning, filtering and organizing the data for successful mining and modeling, by solving or avoiding problems in the data, and presenting the data to the modeling schema in the optimal way • When data is properly prepared: o high quality modeling results are more likely o the quality of models produces will depend primarily on the content of the data, not so much on the modeler’s expertise level o No magic bullet: a quality model can only be produced using adequate data o No general purpose technique, preparation is half art, half science
Data Preparation - How? o Prerequisites o Managing Variables o Building Mineable Datasets
Prerequisites • Exploring the Problem Space • Defining the Solution Space • Data History and Data Understanding
Exploring the Problem Space • A crucial starting point • Avoids any possible misconceptions and unrealistic expectations from predictive analytics project • A MUST: identify the right problem to solve • Define the right target variable! • Examples: o Which activity patterns are most likely fraudulent? Binary or categorical? o Binary overpayment or continuous dollar amount variable? o Continuous days to next purchase or binary days to next purchase