What is Big Data?

Report 1 Downloads 106 Views
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce

Data-Driven Decision Making

Data is the new raw material for any business on par with capital, people, and labor.

What is Big Data? Terabytes of semi-structured log data in which businesses want to:  find correlations/perform pattern matching  generate recommendations  calculate advanced statistics (i.e., TP99)

Twitter “Firehose”  50 million tweets per day  1,400% growth per year  How can advertisers drink from it?

Social graphs  Value increases with exponential growth in data connections

Big Data is full of valuable, unanswered questions!

Why is Big Data Hard (and Getting Harder)? Today’s Data Warehouses  

Need to consolidate from multiple data sources in multiple formats across multiple businesses Unconstrained growth of this business-critical information

Today’s Users    

Expect faster response time of fresher data Sampling is not good enough and history is important Demand inexpensive experimentation with new data Become increasingly sophisticated Data Scientists

Current systems don’t scale (and weren’t meant to)   

Long time to provision more infrastructure Specialized DB expertise required Expensive and inelastic solutions

We need tools built specifically for Big Data!

What is this thing called Hadoop? Dealing with Big Data requires two things:  Distributed, scalable storage  Inexpensive, flexible analytics

Apache Hadoop is an open source software platform that addresses both of these needs  Includes a fault‐tolerant, distributed storage system (HDFS) developed for commodity servers  Uses a technique called MapReduce to carry out exhaustive analysis over huge distributed data sets

Key benefits  Affordable – Cost / TB is a fraction of traditional options  Proven at scale – Numerous petabyte implementations in production; linear scalability  Flexible – Data can be stored with or without schema

RDBMS vs. MapReduce/Hadoop MapReduce/Hadoop

RDBMS  Predefined schema  Strategic data placement for query tuning  Exploit indexes for fast retrieving  SQL only  Doesn’t scale linearly

No schema is required Random data placement Fast scan of the entire dataset Uniform query performance Linearly scales for reads and writes  Support many languages including SQL     

Complementary technologies

Why Amazon Elastic MapReduce? Managed Apache Hadoop Web Service  Monitor thousands of clusters per day  Use cases span from University students to Fortune 50

Reduces complexity of Hadoop management  Handles node provisioning, customization, and shutdown  Tunes Hadoop to your hardware and network  Provides tools to debug and monitor your Hadoop clusters

Provides tight integration with AWS services    

Improved performance working with S3 Automatic re-provisioning on node failure Dynamic expanding/shrinking of cluster size Spot integration

Elastic MapReduce Key Features Simplified Cluster Configuration/Management    

Resize running job flows Support for EIP/IAM/Tagging Workload-specific configurations Bootstrap Actions

Enhanced Monitoring/Debugging  Free CloudWatch Metrics / Alarms  Hadoop Metrics in Console  Ganglia Support

Improved Performance  S3 Multipart Upload  Cluster Compute Instances

Analytics Use Cases Targeted advertising / Clickstream analysis Data warehousing applications Bio-informatics (Genome analysis) Financial simulation (Monte Carlo simulation) File processing (resize jpegs) Web indexing Data mining and BI

APACHE H IVE DATA WAREHOUSE FOR H ADOOP Open source project started at Facebook Turns data on Hadoop into a virtually limitless data warehouse Provides data summarization, ad hoc querying and analysis Enables SQL-like queries on structured and unstructured data  E.g. arbitrary field separators possible such as “,” in CSV file formats

Inherits linear scalability of Hadoop

AWS Data Warehousing Architecture

Elastic Data Warehouse Customize cluster size to support varying resource needs (e.g. query support during the day versus batch processing overnight) Reduce costs by increasing server utilization Improve performance during high usage periods Data Warehouse (Batch Processing)

Data Warehouse (Steady State)

Data Warehouse (Steady State) Expand to 25 instances

Shrink to 9 instances

Reducing Costs with Spot Instances Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption Scenario #1 Job Flow

Scenario #2 Job Flow

#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28

#2: Cost with Spot 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75

Duration: 14 Hours

Duration: 7 Hours

Time Savings: 50% Cost Savings: ~22%

Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing

Monitoring Clusters with CloudWatch Free CloudWatch Metrics and Alarms  Track Hadoop job progress  Alarm on degradations in cluster health  Monitor aggregate Elastic MapReduce usage

Big Data Ecosystem And Tools We have a rapidly growing ecosystem and will continue to integrate with a wide range of partners. Some examples:

Business Intelligence  MicroStrategy, Pentaho

Analytics  Datameer, Karmasphere, Quest

Open source  Ganglia, SQuirrel SQL

Resources Amazon Elastic MapReduce aws.amazon.com/elasticmapreduce aws.amazon.com/articles/Elastic-MapReduce forums.aws.amazon.com/forum.jspa?forumID=52