Intro to Data Analysis

Report 6 Downloads 129 Views
Intro to Data Analysis

Jeff Chen, Deputy Chief Data Officer

before class > Review Intro to R:

https://dataacademy.commerce.gov/previous-courses.html

> Install R Studio: https://www.rstudio.com/

> Download the dataset: https://s3.amazonaws.com/cda-class-data/intro-to-data-analysis/data.Rda

2

before class In R, install the following packages: install.packages(c("plyr", "hexbin", "corrplot"))

3

roadmap Data analysis as a workflow Practical example

4

Why data analysis? Data analysis sheds light on patterns that can inform decision making and advanced research.

5

Why data analysis? Find services are influenced by weather events

Trace the path of an illness outbreak

Identify correlates of economic activity

Identify factors that influence lawsuit outcomes (win or loss)

Characterize the distribution of income and poverty

Flag anomalies in data records

6

verbs of data analysis investigate ask design collect clean explore reshape analyze communicate refine recommend

A data analysis workflow

7

verbs of data analysis investigate ask design collect clean explore reshape analyze communicate refine recommend

• investigate a field • ask research questions • design analytical strategy to answer the question • collect data per the requirements

8

verbs of data analysis investigate ask design collect clean explore reshape analyze communicate refine recommend

• Data quality checks: outliers, missingness, sample size, data classes and formats, erroneous records, signal profile • Standardize records: clean text fields, remove missing records

9

verbs of data analysis investigate ask design collect clean explore reshape analyze communicate refine recommend

• Manipulate data into analyzable form: remove missing records, standardize text fields, convert dates from text to date records • Assess high-level patterns: roll-up data from transactional records to aggregate records, correlation tests, cross-tabs • Hypothesis testing: t-tests, chi-square tests, Kolmogorov-Smirnov test

10

verbs of data analysis investigate ask design collect clean explore reshape analyze communicate refine recommend

• Estimate relationships • Identify patterns • Predict values

11

verbs of data analysis investigate ask design collect clean explore reshape analyze communicate refine recommend

• Converse analytical findings into verbal insight and visual analyses • Solicit feedback from stakeholders, gauge support • Refine data analysis in a v2 (almost always happens) • Provide recommendations based on final product

12

verbs of data analysis investigate ask design collect clean explore reshape analyze communicate refine recommend

Today’s Focus

13

roadmap Data analysis as a workflow Practical example

14

a scenario You're a data analyst at a city government call center. In a given year, your office needs to be able to staff for 2.5 million calls. The head of the agency asks for help in understanding ways to get ahead of call volumes. After shadowing call takers and speaking with various managers, you found two key issues: shift changes need to be more seamless and anticipating volume could help with staffing.

15

the data NYC City Government 311 Service Requests •

Transactional records of public requests for help collected from a large call center



These are ‘Tier 2’ service – it’s a fraction of all calls, but these are the ones that require boots on the ground



Semi-cleaned dataset

16

questions (A) What’s in the dataset? Is the dataset in good shape? (B) Time data: When are the peaks? (C) Time data: What are two good times to place the shift switch? (D) Correlation: Are there agency types that call takers can be cross trained in? (E) Spatial data: Where are the SR hotspots in the city?

17

Load the data setwd("[path-to-the-directory-with-data.Rda]") load("data.Rda")

18

(A) What’s in the dataset? #what are the dimensions of the data? dim(data) #what’s in the data? colnames(data) head(data) #what form are the variables in? str(data) summary(data)

19

questions (A) What’s in the dataset? Is the dataset in good shape? (B) Time data: When are the peaks? (C) Time data: What are two good times to place the shift switch? (D) Correlation: Are there agency types that call takers can be cross trained in? (E) Spatial data: Where are the SR hotspots in the city?

20

(B) When are the peaks? # Notice the format of date-time data: # what needs to happen to convert raw time records into # usable information? head(data$date_time, 20)

21

(B)

x f x f

When are the peaks? (Process) Find the time variable, check format Convert into appropriate time format Aggregate or roll up by time value Check and adjust erroneous or anomalous reporting Roll up cleaned Plot result

22

(B)

When are the peaks? (Process)

Before we start this exercise, we’ll need to first load in plyr. library(plyr)

23

(B)

When are the peaks? (code)

When are the peak service months? data$month = format(data$date_time, "%m") month