Introduction to Machine Learning Rebecca Bilbro and Star Ying 7/11/2016
Star Ying (
[email protected]) Data Scientist, Commerce Data Service Researcher and Mentor, International Strategy and Reconciliation Foundation Tutor, Horton’s Kids
Dr. Rebecca Bilbro (
[email protected]) Data Scientist, Commerce Data Service Board Member, Data Community DC Faculty, Georgetown School of Continuing Studies and District Data Labs
Commerce Data Academy ● A data education initiative of the Commerce Data Service. ● Launched by CDS to offer data science, data engineering, and web development training to employees of the US Department of Commerce. ● Course schedule and materials (e.g. slides, code, papers) produced for the Commerce Data Academy on Github. ● Questions? Feel free to write us at Data Academy (
[email protected]).
Goals Our goals for the class ● Teach you the vocabulary used in machine learning. ● Teach you how to determine if a problem is suited to machine learning. Your goals for the class ● Become conversant in the terminology of machine learning. ● Begin to be able to identify if a problem is suited to machine learning.
Outline ● ● ● ● ●
Review Thought Process Process flow How it maps to python, pandas, sklearn Simple examples ○ Unsupervised Learning (NIST News) ○ Supervised Learning (Titanic) 5
Prerequisites
6
Prerequisites 1. Create your own Github account 2. Download/install Git 3. Download/install Anaconda's Python distribution (If you have Anaconda, you shouldn’t need anything more for this class. If you are using a different Python distro, make sure you also have Jupyter notebook, NumPy, Pandas, Matplotlib, and Sqlite3 installed.)
4. Verify your access to Terminal (Mac) or Powershell (Windows) Any challenges? Questions?
7
Open Sources Installations 1. 2. 3. 4.
We use open source and free software, so they should have a minimal impact on your IT department! DOC has provided guidance that states that states that Github and all the tools that we are teaching are permissible under policy. However, it is up to the CIO of each bureau to accept this guidance policy or not. DOC has a formalized Github policy: https://github.com/CommerceGov/Policiesand-Guidance/blob/master/GithubGuidanceforDepartmentofCommerce.md
8
Review
9
What is data science?
10
“Data science is the practice of transforming raw data into insights, products, and applications to empower data-driven decision making. It combines proven, time-tested methods from fields including statistics, natural sciences, computer science, operations research, and design in ways that are particularly well-suited to the data age. These methods, which range from data mining and visualization to predictive modeling, can scale from small to large datasets and can handle structured data as well as unstructured data like text and images.” Jeff Chen, Chief Data Scientist U.S. Department of Commerce11
What is the data science pipeline?
12
Data Ingestion
Data Munging and Wrangling
Computation and Analyses
Reporting and Visualization
Modeling and Application
13
So then what is machine learning?
14
Statistician
Expert
Programmer
Machine Learning Practitioner Domains
Hacker
Machine Learning Practitioner Continuum
Academic
15
In the context of data mining and statistics...
16
Learning by Example Given a bunch of examples (data) extract a meaningful pattern upon which to act. Problem Domain
Machine Learning Class
Infer a function from labeled data
Supervised learning
Find structure of data without feedback
Unsupervised learning
Interact with environment towards goal
Reinforcement learning 17
Unstable Data Randomness is a significant part of data in the real world but problems with data can significantly affect results: -
outliers skew missing information incorrect data
18
Unpredictable Future Machine learning models attempt to predict the future as new inputs come in - but human systems and processes are subject to change.
19
Thought Process
20
Process Flow
21
How do you (as a human) make predictions?
22
What patterns do you see? 23
How about now? 24
Will the next point be to the right or the left of the dotted line?
25
How about now? 26
Should we make the un-colored dots red or blue? 27
How about now? 28
Types of Algorithms by Output Input training data to fit a model which is then used to predict incoming inputs into ... Type of Output
Algorithm Category
Output is one or more discrete classes
Classification (supervised)
Output is continuous
Regression (supervised)
Output is membership in a similar group
Clustering (unsupervised)
Output is the distribution of inputs
Density Estimation
Output is simplified from higher dimensions
Dimensionality Reduction 29
Classification
Given labeled input data (with two or more labels), fit a function that can determine for any input, what the label is.
30
Regression
Given continuous, labeled input data, fit a function that is able to predict the continuous value of input given other data.
31
Clustering
Given unlabeled data, determine a pattern of associated data points or clusters via their similarity or distance from one another.
32
Dimensions and Features In order to do machine learning you need a data set containing instances (examples) that are composed of features from which you compose dimensions. Instance: a single data point or example composed of fields Feature: a quantity describing an instance Dimension: one or more attributes that describe a property 33
A tour of (some) machine learning model families
34
Hadley Wickham (2015) “Model” is an overloaded term. • Model family describes, at the broadest possible level, the connection between the variables of interest. • Model form specifies exactly how the variables of interest are connected within the framework of the model family. • A fitted model is a concrete instance of the model form where all parameters have been estimated from data, and the model can be used to generate predictions. 35
Models: Regression Model relationship of independent variables, X to dependent variable Y by iteratively optimizing error made in predictions. ● ● ● ● ●
Ordinary Least Squares Logistic Regression Stepwise Regression Multivariate Adaptive Regression Splines (MARS) Locally Estimated Scatterplot Smoothing (LOESS) 36
Models: Regularization Methods Extend another method (usually regression), penalizing complexity (minimize overfit) - simple, popular, powerful - better at generalization ● Ridge Regression ● LASSO (Least Absolute Shrinkage & Selection Operator) ● Elastic Net 37
Models: Decision Trees Model of decisions based on data attributes. Predictions are made by following forks in a tree structure until a decision is made. Used for classification & regression. ● ● ● ● ●
Classification and Regression Tree (CART) Decision Stump Random Forest Multivariate Adaptive Regression Splines (MARS) Gradient Boosting Machines (GBM) 38
Models: Clustering Methods Organize data into into groups whose members share maximum similarity (defined usually by a distance metric). Two main approaches: centroids and hierarchical clustering. ● k-Means ● Affinity Propegation ● OPTICS (Ordering Points to Identify Cluster Structure) ● Agglomerative Clustering 39
Models: Ensemble Methods Models composed of multiple weak models that are trained independently and whose outputs are combined to make an overall prediction. ● ● ● ● ● ●
Boosting Bootstrapped Aggregation (Bagging) AdaBoost Stacked Generalization (blending) Gradient Boosting Machines (GBM) Random Forest
40
Discuss thought process/problem
41
Break
42
In Practice with Python
43
What is Scikit-Learn?
44
What is Scikit-Learn? Extensions to SciPy (Scientific Python) are called SciKits. SciKit-Learn provides machine learning algorithms. ● ● ● ● ●
Algorithms for supervised & unsupervised learning Built on SciPy and Numpy Standard Python API interface Sits on top of c libraries, LAPACK, LibSVM, and Cython Open Source: BSD License (part of Linux)
Probably the best general ML framework out there.
45
Features of Scikit-Learn -
Generalized Linear Models SVMs, kNN, Bayes, Decision Trees, Ensembles Clustering and Density algorithms Cross Validation Grid Search Pipelining Model Evaluations Dataset Transformations Dataset Loading 46
The Scikit-Learn API
47
Object-oriented interface centered around the concept of an Estimator: “An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data.” - Scikit-Learn Tutorial 48
class Estimator(object): def fit(self, X, y=None): """Fits estimator to data. """ # set state of ``self`` return self def predict(self, X): """Predict response of ``X``. """ # compute predictions ``pred`` return pred 49
Estimators - fit(X,y) sets the state of the estimator. - X is usually a 2D numpy array of shape (num_samples, num_features). - y is a 1D array with shape (n_samples,) - predict(X) returns the class or value - predict_proba() returns a 2D array of shape (n_samples, n_classes) 50
Basic Deployment from sklearn import svm estimator = svm.SVC(gamma=0.001) estimator.fit(X, y) estimator.predict(x)
51
Transformers from sklearn import preprocessing class Transformer(Estimator): def transform(self, X): """Transforms the input data. """ # transform ``X`` to ``X_prime`` return X_prime Xt = preprocessing.normalize(X) # Normalizer Xt = preprocessing.scale(X) # StandardScaler imputer =Imputer(missing_values='Nan', strategy='mean') Xt = imputer.fit_transform(X) 52
Choosing the right estimator (model)
53
54
Model Evaluation
55
Underfitting Not enough information to accurately model real life. Can be due to high bias, or just a too simplistic model.
Overfitting Create a model with too many parameters or is too complex. “Memorization of the data” - and the model can’t generalize very well. 56
Error: Bias vs. Variance
http://scott.fortmann-roe.com/docs/BiasVariance.html
57
precision = true positives / (true positives + false positives) recall = true positives / (false negatives + true positives)
58
MSE & R-Squared In regressions we can determine how well the model fits by computing the mean square error and the coefficient of determination. Mean Squared Error (MSE) = E((ŷ-y)^2) Coefficient of Determination (R2) is a predictor of “goodness of fit” and is a value ∈ [0,1] where 1 is perfect fit. It describes how much variance in the dependent variable can be explained by the independent variables. 59
Other means of evaluation How to evaluate clusters? Visualization (but only in 2D)
60
Clustering Demo
61
Break
62
Titanic Demo
63
How do you operationalize machine learning?
64
Architecture of Machine Learning Operations
65
Machine learning in the wild
66
Further Reading ● ● ● ● ● ● ● ●
Wasserman, Larry. All of Statistics: A Concise Course in Statistical Inference. Gelman, Andrew. Bayesian Data Analysis. Leskovec, Jure, and Anand Rajaraman and Jeffrey David Ullman. Mining Massive Datasets. Hastie, Trevor Hasie and Robert Tibshirani and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Fox, John. Applied Regression Analysis and Generalized Linear Models. Bishop, Christopher. Pattern Recognition and Machine Learning. Segaran, Toby. Programming Collective Intelligence. Kirk, Matthew. Thoughtful Machine Learning. 67
Additional Resources ●
MIT OpenCourseware ○ ○ ○ ○ ○ ○
● ● ● ●
Machine Learning Statistics Probability Linear Algebra Algorithms Optimization
An Introduction to Machine Learning with Python (Rebecca Bilbro) Machine Learning map for scikit learn and in general Binge watch machine learning Introduction to Statistical Learning in R 68
69
70
Special thanks to my teachers:
Benjamin Bengfort github.com/bbengfort @bbengfort
Allen Leis github.com/looselycoupled @looselycoupled Faculty at Georgetown School of Continuing Studies Graduate students and the University of Maryland, College Park
(These are mostly their slides!) 71
Questions?
[email protected] 72