Analytics in the Cloud Peter Sirota, GM Elastic MapReduce
Data-Driven Decision Making
Data is the new raw material for any business on par with capital, people, and labor.
What is Big Data? Terabytes of semi-structured log data in which businesses want to: find correlations/perform pattern matching generate recommendations calculate advanced statistics (i.e., TP99)
Twitter “Firehose” 50 million tweets per day 1,400% growth per year How can advertisers drink from it?
Social graphs Value increases with exponential growth in data connections
Big Data is full of valuable, unanswered questions!
Why is Big Data Hard (and Getting Harder)? Today’s Data Warehouses
Need to consolidate from multiple data sources in multiple formats across multiple businesses Unconstrained growth of this business-critical information
Today’s Users
Expect faster response time of fresher data Sampling is not good enough and history is important Demand inexpensive experimentation with new data Become increasingly sophisticated Data Scientists
Current systems don’t scale (and weren’t meant to)
Long time to provision more infrastructure Specialized DB expertise required Expensive and inelastic solutions
We need tools built specifically for Big Data!
What is this thing called Hadoop? Dealing with Big Data requires two things: Distributed, scalable storage Inexpensive, flexible analytics
Apache Hadoop is an open source software platform that addresses both of these needs Includes a fault‐tolerant, distributed storage system (HDFS) developed for commodity servers Uses a technique called MapReduce to carry out exhaustive analysis over huge distributed data sets
Key benefits Affordable – Cost / TB is a fraction of traditional options Proven at scale – Numerous petabyte implementations in production; linear scalability Flexible – Data can be stored with or without schema
RDBMS vs. MapReduce/Hadoop MapReduce/Hadoop
RDBMS Predefined schema Strategic data placement for query tuning Exploit indexes for fast retrieving SQL only Doesn’t scale linearly
No schema is required Random data placement Fast scan of the entire dataset Uniform query performance Linearly scales for reads and writes Support many languages including SQL
Complementary technologies
Why Amazon Elastic MapReduce? Managed Apache Hadoop Web Service Monitor thousands of clusters per day Use cases span from University students to Fortune 50
Reduces complexity of Hadoop management Handles node provisioning, customization, and shutdown Tunes Hadoop to your hardware and network Provides tools to debug and monitor your Hadoop clusters
Provides tight integration with AWS services
Improved performance working with S3 Automatic re-provisioning on node failure Dynamic expanding/shrinking of cluster size Spot integration
Analytics Use Cases Targeted advertising / Clickstream analysis Data warehousing applications Bio-informatics (Genome analysis) Financial simulation (Monte Carlo simulation) File processing (resize jpegs) Web indexing Data mining and BI
APACHE H IVE DATA WAREHOUSE FOR H ADOOP Open source project started at Facebook Turns data on Hadoop into a virtually limitless data warehouse Provides data summarization, ad hoc querying and analysis Enables SQL-like queries on structured and unstructured data E.g. arbitrary field separators possible such as “,” in CSV file formats
Inherits linear scalability of Hadoop
AWS Data Warehousing Architecture
Elastic Data Warehouse Customize cluster size to support varying resource needs (e.g. query support during the day versus batch processing overnight) Reduce costs by increasing server utilization Improve performance during high usage periods Data Warehouse (Batch Processing)
Data Warehouse (Steady State)
Data Warehouse (Steady State) Expand to 25 instances
Shrink to 9 instances
Reducing Costs with Spot Instances Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption Scenario #1 Job Flow