Intro to Data Analysis
Jeff Chen, Deputy Chief Data Officer
before class > Review Intro to R:
https://dataacademy.commerce.gov/previous-courses.html
> Install R Studio: https://www.rstudio.com/
> Download the dataset: https://s3.amazonaws.com/cda-class-data/intro-to-data-analysis/data.Rda
2
before class In R, install the following packages: install.packages(c("plyr", "hexbin", "corrplot"))
3
roadmap Data analysis as a workflow Practical example
4
Why data analysis? Data analysis sheds light on patterns that can inform decision making and advanced research.
5
Why data analysis? Find services are influenced by weather events
Trace the path of an illness outbreak
Identify correlates of economic activity
Identify factors that influence lawsuit outcomes (win or loss)
Characterize the distribution of income and poverty
Flag anomalies in data records
6
verbs of data analysis investigate ask design collect clean explore reshape analyze communicate refine recommend
A data analysis workflow
7
verbs of data analysis investigate ask design collect clean explore reshape analyze communicate refine recommend
• investigate a field • ask research questions • design analytical strategy to answer the question • collect data per the requirements
8
verbs of data analysis investigate ask design collect clean explore reshape analyze communicate refine recommend
• Data quality checks: outliers, missingness, sample size, data classes and formats, erroneous records, signal profile • Standardize records: clean text fields, remove missing records
9
verbs of data analysis investigate ask design collect clean explore reshape analyze communicate refine recommend
• Manipulate data into analyzable form: remove missing records, standardize text fields, convert dates from text to date records • Assess high-level patterns: roll-up data from transactional records to aggregate records, correlation tests, cross-tabs • Hypothesis testing: t-tests, chi-square tests, Kolmogorov-Smirnov test
10
verbs of data analysis investigate ask design collect clean explore reshape analyze communicate refine recommend
• Estimate relationships • Identify patterns • Predict values
11
verbs of data analysis investigate ask design collect clean explore reshape analyze communicate refine recommend
• Converse analytical findings into verbal insight and visual analyses • Solicit feedback from stakeholders, gauge support • Refine data analysis in a v2 (almost always happens) • Provide recommendations based on final product
12
verbs of data analysis investigate ask design collect clean explore reshape analyze communicate refine recommend
Today’s Focus
13
roadmap Data analysis as a workflow Practical example
14
a scenario You're a data analyst at a city government call center. In a given year, your office needs to be able to staff for 2.5 million calls. The head of the agency asks for help in understanding ways to get ahead of call volumes. After shadowing call takers and speaking with various managers, you found two key issues: shift changes need to be more seamless and anticipating volume could help with staffing.
15
the data NYC City Government 311 Service Requests •
Transactional records of public requests for help collected from a large call center
•
These are ‘Tier 2’ service – it’s a fraction of all calls, but these are the ones that require boots on the ground
•
Semi-cleaned dataset
16
questions (A) What’s in the dataset? Is the dataset in good shape? (B) Time data: When are the peaks? (C) Time data: What are two good times to place the shift switch? (D) Correlation: Are there agency types that call takers can be cross trained in? (E) Spatial data: Where are the SR hotspots in the city?
17
Load the data setwd("[path-to-the-directory-with-data.Rda]") load("data.Rda")
18
(A) What’s in the dataset? #what are the dimensions of the data? dim(data) #what’s in the data? colnames(data) head(data) #what form are the variables in? str(data) summary(data)
19
questions (A) What’s in the dataset? Is the dataset in good shape? (B) Time data: When are the peaks? (C) Time data: What are two good times to place the shift switch? (D) Correlation: Are there agency types that call takers can be cross trained in? (E) Spatial data: Where are the SR hotspots in the city?
20
(B) When are the peaks? # Notice the format of date-time data: # what needs to happen to convert raw time records into # usable information? head(data$date_time, 20)
21
(B)
x f x f
When are the peaks? (Process) Find the time variable, check format Convert into appropriate time format Aggregate or roll up by time value Check and adjust erroneous or anomalous reporting Roll up cleaned Plot result
22
(B)
When are the peaks? (Process)
Before we start this exercise, we’ll need to first load in plyr. library(plyr)
23
(B)
When are the peaks? (code)
When are the peak service months? data$month = format(data$date_time, "%m") month