Data Preparation - Get the Most Out of Your Data

Report 3 Downloads 78 Views
Data Preparation: Get the Most out of Your Data, Big and Small IEEE San Diego Chapter Lecture Tamara B. Sipes, Ph.D. August 17th, 2016

Agenda  Introduction  Data Preparation : Why?  Data Preparation : How?  Challenges and Solutions  Conclusions & Future Directions

Introduction o o o o

Background Evolution Paradigms and Tools The Process

Best Job in America? •

Data Science • Extraction or “mining” of knowledge and actionable insights from data • Data-driven discovery and modeling of hidden patterns (we never knew existed) in large and/or complex data • Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data

Multidisciplinary Field Artificial Intelligence Statistics

Machine Learning

Predictive Analytics Big Data

Data Science Databases

High Performance Computing

Sipes (2016)

History and Jargon • • • •

Emerged late 1980s Flourished in 1990s Multidisciplinary field Roots traced back along three family lines: o Classical Statistics o Artificial Intelligence o Machine Learning

• Pattern Recognition • Database Mining • Knowledge Discovery from Databases (KDD) • Data Mining • Knowledge Extraction • Knowledge Mining • Predictive Analytics • Advanced Analytics • Intelligent Analytics • Business Intelligence • Most recently: Data Science

Major Paradigms Data Type

• Supervised • Unsupervised • Semi-Supervised

Data Availability

• Incremental • Non-incremental

Output Type

• Descriptive • Predictive • Prescriptive

Basic Methods • Clustering: k-means, EM, hierarchical • Association Rules • Bayesian Learners • Instance-Based Learning • Regression • Classification Rules: List, Table, Decision Trees • Numeric Prediction: Regression and Model Trees • Random Forests • Hidden Markov Model (HMM) • Artificial Neural Networks (ANN) • Support Vector Machines (SVM)

Advanced Methods • Semi-Supervised Learning • Anomaly Detection Methods • Ensemble Learning • Deep Learning

CRISP-DM Methodology • Cross Industry Standard Process for Data Mining o http://www.crisp-dm.org/

• Six Phases: o o o o o o

Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment

Data Preparation - Why? o State of Affairs o Motivation and Importance o Definition and Goal

Data Preparation – 80%

www.forbes.com , #1418443d7f75, March 23, 2016

The Least Favorite Part

www.forbes.com , #1418443d7f75, March 23, 2016

Motivation and Importance • Data in the real world is far from clean o incomplete: lacking values, lacking attributes of interest, or aggregated o noisy: containing errors or outliers o inconsistent: containing discrepancies in codes or names

• No quality data, no quality mining results “Garbage in, garbage out” • Need to know what to do with the “unclean data” • A crucial step of any predictive analytics project • Could be the difference between a successful project and a failure

Definition • Data Preparation is a process of: cleaning, filtering and organizing the data for successful mining and modeling, by solving or avoiding problems in the data, and presenting the data to the modeling schema in the optimal way • When data is properly prepared: o high quality modeling results are more likely o the quality of models produces will depend primarily on the content of the data, not so much on the modeler’s expertise level o No magic bullet: a quality model can only be produced using adequate data o No general purpose technique, preparation is half art, half science

Data Preparation - How? o Prerequisites o Managing Variables o Building Mineable Datasets

Prerequisites • Exploring the Problem Space • Defining the Solution Space • Data History and Data Understanding

Exploring the Problem Space • A crucial starting point • Avoids any possible misconceptions and unrealistic expectations from predictive analytics project • A MUST: identify the right problem to solve • Define the right target variable! • Examples: o Which activity patterns are most likely fraudulent? Binary or categorical? o Binary overpayment or continuous dollar amount variable? o Continuous days to next purchase or binary days to next purchase