Introduction to Machine Learning

Comment

Report 8 Downloads 213 Views

Introduction to Machine Learning Rebecca Bilbro and Star Ying 7/11/2016

Star Ying ([email protected]) Data Scientist, Commerce Data Service Researcher and Mentor, International Strategy and Reconciliation Foundation Tutor, Horton’s Kids

Dr. Rebecca Bilbro ([email protected]) Data Scientist, Commerce Data Service Board Member, Data Community DC Faculty, Georgetown School of Continuing Studies and District Data Labs

Commerce Data Academy ● A data education initiative of the Commerce Data Service. ● Launched by CDS to offer data science, data engineering, and web development training to employees of the US Department of Commerce. ● Course schedule and materials (e.g. slides, code, papers) produced for the Commerce Data Academy on Github. ● Questions? Feel free to write us at Data Academy ([email protected]).

Goals Our goals for the class ● Teach you the vocabulary used in machine learning. ● Teach you how to determine if a problem is suited to machine learning. Your goals for the class ● Become conversant in the terminology of machine learning. ● Begin to be able to identify if a problem is suited to machine learning.

Outline ● ● ● ● ●

Review Thought Process Process flow How it maps to python, pandas, sklearn Simple examples ○ Unsupervised Learning (NIST News) ○ Supervised Learning (Titanic) 5

Prerequisites

6

Prerequisites 1. Create your own Github account 2. Download/install Git 3. Download/install Anaconda's Python distribution (If you have Anaconda, you shouldn’t need anything more for this class. If you are using a different Python distro, make sure you also have Jupyter notebook, NumPy, Pandas, Matplotlib, and Sqlite3 installed.)

4. Verify your access to Terminal (Mac) or Powershell (Windows) Any challenges? Questions?

7

Open Sources Installations 1. 2. 3. 4.

We use open source and free software, so they should have a minimal impact on your IT department! DOC has provided guidance that states that states that Github and all the tools that we are teaching are permissible under policy. However, it is up to the CIO of each bureau to accept this guidance policy or not. DOC has a formalized Github policy: https://github.com/CommerceGov/Policiesand-Guidance/blob/master/GithubGuidanceforDepartmentofCommerce.md

8

Review

9

What is data science?

10

“Data science is the practice of transforming raw data into insights, products, and applications to empower data-driven decision making. It combines proven, time-tested methods from fields including statistics, natural sciences, computer science, operations research, and design in ways that are particularly well-suited to the data age. These methods, which range from data mining and visualization to predictive modeling, can scale from small to large datasets and can handle structured data as well as unstructured data like text and images.” Jeff Chen, Chief Data Scientist U.S. Department of Commerce11

What is the data science pipeline?

12

Data Ingestion

Data Munging and Wrangling

Computation and Analyses

Reporting and Visualization

Modeling and Application

13

So then what is machine learning?

14

Statistician

Expert

Programmer

Machine Learning Practitioner Domains

Hacker

Machine Learning Practitioner Continuum

Academic

15

In the context of data mining and statistics...

16

Learning by Example Given a bunch of examples (data) extract a meaningful pattern upon which to act. Problem Domain

Machine Learning Class

Infer a function from labeled data

Supervised learning

Find structure of data without feedback

Unsupervised learning

Interact with environment towards goal

Reinforcement learning 17

Unstable Data Randomness is a significant part of data in the real world but problems with data can significantly affect results: -

outliers skew missing information incorrect data

18

Unpredictable Future Machine learning models attempt to predict the future as new inputs come in - but human systems and processes are subject to change.

19

Thought Process

20

Process Flow

21

How do you (as a human) make predictions?

22

What patterns do you see? 23

How about now? 24

Will the next point be to the right or the left of the dotted line?

25

How about now? 26

Should we make the un-colored dots red or blue? 27

How about now? 28

Types of Algorithms by Output Input training data to fit a model which is then used to predict incoming inputs into ... Type of Output

Algorithm Category

Output is one or more discrete classes

Classification (supervised)

Output is continuous

Regression (supervised)

Output is membership in a similar group

Clustering (unsupervised)

Output is the distribution of inputs

Density Estimation

Output is simplified from higher dimensions

Dimensionality Reduction 29

Classification

Given labeled input data (with two or more labels), fit a function that can determine for any input, what the label is.

30

Regression

Given continuous, labeled input data, fit a function that is able to predict the continuous value of input given other data.

31

Clustering

Given unlabeled data, determine a pattern of associated data points or clusters via their similarity or distance from one another.

32

Dimensions and Features In order to do machine learning you need a data set containing instances (examples) that are composed of features from which you compose dimensions. Instance: a single data point or example composed of fields Feature: a quantity describing an instance Dimension: one or more attributes that describe a property 33

A tour of (some) machine learning model families

34

Hadley Wickham (2015) “Model” is an overloaded term. • Model family describes, at the broadest possible level, the connection between the variables of interest. • Model form specifies exactly how the variables of interest are connected within the framework of the model family. • A fitted model is a concrete instance of the model form where all parameters have been estimated from data, and the model can be used to generate predictions. 35

Models: Regression Model relationship of independent variables, X to dependent variable Y by iteratively optimizing error made in predictions. ● ● ● ● ●

Ordinary Least Squares Logistic Regression Stepwise Regression Multivariate Adaptive Regression Splines (MARS) Locally Estimated Scatterplot Smoothing (LOESS) 36

Models: Regularization Methods Extend another method (usually regression), penalizing complexity (minimize overfit) - simple, popular, powerful - better at generalization ● Ridge Regression ● LASSO (Least Absolute Shrinkage & Selection Operator) ● Elastic Net 37

Models: Decision Trees Model of decisions based on data attributes. Predictions are made by following forks in a tree structure until a decision is made. Used for classification & regression. ● ● ● ● ●

Classification and Regression Tree (CART) Decision Stump Random Forest Multivariate Adaptive Regression Splines (MARS) Gradient Boosting Machines (GBM) 38

Models: Clustering Methods Organize data into into groups whose members share maximum similarity (defined usually by a distance metric). Two main approaches: centroids and hierarchical clustering. ● k-Means ● Affinity Propegation ● OPTICS (Ordering Points to Identify Cluster Structure) ● Agglomerative Clustering 39

Models: Ensemble Methods Models composed of multiple weak models that are trained independently and whose outputs are combined to make an overall prediction. ● ● ● ● ● ●

Boosting Bootstrapped Aggregation (Bagging) AdaBoost Stacked Generalization (blending) Gradient Boosting Machines (GBM) Random Forest

40

Discuss thought process/problem

41

Break

42

In Practice with Python

43

What is Scikit-Learn?

44

What is Scikit-Learn? Extensions to SciPy (Scientific Python) are called SciKits. SciKit-Learn provides machine learning algorithms. ● ● ● ● ●

Algorithms for supervised & unsupervised learning Built on SciPy and Numpy Standard Python API interface Sits on top of c libraries, LAPACK, LibSVM, and Cython Open Source: BSD License (part of Linux)

Probably the best general ML framework out there.

45

Features of Scikit-Learn -

Generalized Linear Models SVMs, kNN, Bayes, Decision Trees, Ensembles Clustering and Density algorithms Cross Validation Grid Search Pipelining Model Evaluations Dataset Transformations Dataset Loading 46

The Scikit-Learn API

47

Object-oriented interface centered around the concept of an Estimator: “An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data.” - Scikit-Learn Tutorial 48

class Estimator(object): def fit(self, X, y=None): """Fits estimator to data. """ # set state of ``self`` return self def predict(self, X): """Predict response of ``X``. """ # compute predictions ``pred`` return pred 49

Estimators - fit(X,y) sets the state of the estimator. - X is usually a 2D numpy array of shape (num_samples, num_features). - y is a 1D array with shape (n_samples,) - predict(X) returns the class or value - predict_proba() returns a 2D array of shape (n_samples, n_classes) 50

Basic Deployment from sklearn import svm estimator = svm.SVC(gamma=0.001) estimator.fit(X, y) estimator.predict(x)

51

Transformers from sklearn import preprocessing class Transformer(Estimator): def transform(self, X): """Transforms the input data. """ # transform ``X`` to ``X_prime`` return X_prime Xt = preprocessing.normalize(X) # Normalizer Xt = preprocessing.scale(X) # StandardScaler imputer =Imputer(missing_values='Nan', strategy='mean') Xt = imputer.fit_transform(X) 52

Choosing the right estimator (model)

53

54

Model Evaluation

55

Underfitting Not enough information to accurately model real life. Can be due to high bias, or just a too simplistic model.

Overfitting Create a model with too many parameters or is too complex. “Memorization of the data” - and the model can’t generalize very well. 56

Error: Bias vs. Variance

http://scott.fortmann-roe.com/docs/BiasVariance.html

57

precision = true positives / (true positives + false positives) recall = true positives / (false negatives + true positives)

58

MSE & R-Squared In regressions we can determine how well the model fits by computing the mean square error and the coefficient of determination. Mean Squared Error (MSE) = E((ŷ-y)^2) Coefficient of Determination (R2) is a predictor of “goodness of fit” and is a value ∈ [0,1] where 1 is perfect fit. It describes how much variance in the dependent variable can be explained by the independent variables. 59

Other means of evaluation How to evaluate clusters? Visualization (but only in 2D)

60

Clustering Demo

61

Break

62

Titanic Demo

63

How do you operationalize machine learning?

64

Architecture of Machine Learning Operations

65

Machine learning in the wild

66

Further Reading ● ● ● ● ● ● ● ●

Wasserman, Larry. All of Statistics: A Concise Course in Statistical Inference. Gelman, Andrew. Bayesian Data Analysis. Leskovec, Jure, and Anand Rajaraman and Jeffrey David Ullman. Mining Massive Datasets. Hastie, Trevor Hasie and Robert Tibshirani and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Fox, John. Applied Regression Analysis and Generalized Linear Models. Bishop, Christopher. Pattern Recognition and Machine Learning. Segaran, Toby. Programming Collective Intelligence. Kirk, Matthew. Thoughtful Machine Learning. 67

Additional Resources ●

MIT OpenCourseware ○ ○ ○ ○ ○ ○

● ● ● ●

Machine Learning Statistics Probability Linear Algebra Algorithms Optimization

An Introduction to Machine Learning with Python (Rebecca Bilbro) Machine Learning map for scikit learn and in general Binge watch machine learning Introduction to Statistical Learning in R 68

69

70

Special thanks to my teachers:

Benjamin Bengfort github.com/bbengfort @bbengfort

Allen Leis github.com/looselycoupled @looselycoupled Faculty at Georgetown School of Continuing Studies Graduate students and the University of Maryland, College Park

(These are mostly their slides!) 71

Questions? [email protected]

72

Recommend Documents

INTRODUCTION TO MACHINE LEARNING

introduction to machine learning

Introduction to Machine Quilting

Introduction to Machine Applique