Using R for Data Profiling

Report 0 Downloads 33 Views
Want to follow along with this session using R? Download the script and data from the session scheduler. Also download R and RStudio. It’s easy to follow along!

© 2016 RED PILL Analytics

Text Here

Using R for Data Profiling Michelle Kolbe medium.com/@datacheesehead

@mekolbe

linkedin.com/in/michellekolbe

[email protected]

www.RedPillAnalytics.com

[email protected]

@RedPillA

© 2016 RED PILL Analytics

3

© 2016 RED PILL Analytics

Do you have a data quality problem?

Yes! Gartner estimated “more than 25 percent of critical data within Fortune 1000 enterprises” to be flawed. TDWI stated that “data quality problems cost US businesses more than $600 billion a year” and poor data quality leads to failure and delays of many high profile IT projects. Lack of trust in the data results in reduced or discontinued BI usage Source: https://datasourceconsulting.com/data-profiling/

What to Check for?

• Accuracy • Consistency • Completeness • Uniqueness • Distribution • Range

www.RedPillAnalytics.com

[email protected]

@RedPillA

© 2016 RED PILL Analytics

5

© 2016 RED PILL Analytics

Why Profile Your Data?

Benefits

• Trust in data • Find problems in advance • Shorten development time on projects • Improve understanding of data & business knowledge

www.RedPillAnalytics.com

[email protected]

@RedPillA

© 2016 RED PILL Analytics

7

© 2016 RED PILL Analytics

Why R?

Why R?

• Free! • Easy to use • Flexible • Powerful analytics • Great community!

www.RedPillAnalytics.com

[email protected]

Flexible because it’s a language And you can use varied datasets, do data manipulation, & run stats models

@RedPillA

© 2016 RED PILL Analytics

9

© 2016 RED PILL Analytics

Getting Started in R

What is R? • A programming environment • Fairly simple to use & understand • Allows a user to manipulate & analyze data • Open source • Real power comes from available packages you can install from LARGE community • Easy to learn with programming background • Con: Memory management & speed vs C++ or Python

www.RedPillAnalytics.com

[email protected]

@RedPillA

© 2016 RED PILL Analytics

11

Tools for R

• First download R from r-project.org • Then download R Studio, the best R IDE

www.RedPillAnalytics.com

[email protected]

@RedPillA

© 2016 RED PILL Analytics

12

R Basics

• Case sensitive • ”)

• Once installed, load the package
 library(“<package name>”)

• Note that every time you open R you’ll need to load the packages you’ll be using • You’ll see your packages that are installed and loaded in R Studio

www.RedPillAnalytics.com

[email protected]

@RedPillA

© 2016 RED PILL Analytics

14

Connecting to Data in R

• Data should be read into R and stored into an object • Easiest with CSV • Can download datasets from a url or located on a drive


d