Turn Big Data to Small Data - AWS

Report 0 Downloads 147 Views
Turn Big Data to Small Data Use Qlik to Utilize Distributed Systems and Document Databases 04 October, 2014 Stig Magne Henriksen

From Big Data to Small Data

Image: kdnuggets.com

Agenda • When do we have a Big Data problem • How to use Qlik for Analyzing Big Data, breaking into small chunks of data that can be analyzed

• General strategies for handling of Big Data • Discussion on how to handle distributed systems • Discussion on how Qlik can read data from MongoDB database

• Discussion on how the use Qlik to read data from Hortonworks • Summary

When do we have a Big Data problem? • What happens when the amount of data is so huge that it is not possible store it in a database? – Nor is it possible to store it in the memory of a computer • Too many bytes(Volume)

• What happens when the rate of change(new sources) of the data is so frequent that a solution created a couple of weeks ago - today is out of date? • Too many sources(Variety)

When do we have a Big Data problem II? • What happens when data from Internet of things and mobile Apps increase immensely? • Too high rate(Velocity)

• What happens when a company with 200 branches with different variations on the nearly identical excel spreadsheet? • Non scalable analysis

Have to Find the Useful Information

Image: thestoragealchemist.com

Why Qlik with Big Data? •

Flexible Deployment Models – In Memory with use of ODBC or OLE-DB – Direct Discovery –



Application (Document) Chaining

Combine Big Data and traditional data sources In Memory

Direct Discovery

Hybrid

Qlik In-Memory Approach •

Loads compressed data into memory



Enables associative search and analysis



100’s millions to billions of rows of data

Qlik Direct Discovery Approach • Combines the associative capabilities of the Qlik in-memory dataset with a query model where:

 The aggregated query result is passed back to a Qlik object without being loaded into the Qlik data model  The result set is still part of the associative experience  Capability to Drill to Detail records Qlik In-Memory Data Model

Batch Load

Qlik Application Direct Discovery

Application (Document) Chaining • Navigate among Qlik applications • Maintain Selections / Context 1) User makes selections in Application 1 2) Click a button to Application Chain

3) Application 2 opened, selections are transferred and applied

Distributed systems Why use them? • Advangtages of distributed computed platforms - Parallelize I/O to quickly scan large datasets • Cost effiency - Commodity nodes (cheap but unreliable) - Commodity network(might have low bandwith) - Automatic fault.tolerance (few admins) - Easier to use(fewer programmers)

Two different approaches - Hortonworks Direct Discovery

Use ODBC and in memory

• Can access data from external sources into Qlik

• Access to Hortonworks

• Will not load data until it is requested from the app

• Only meta data is loaded • Real time load of data

• Can read documents from the database • Can read complex objects from a document • Can read sub levels of each instance in the Collections

Qlik and Hortonworks 100’s millions rows into Memory Broad Application to discover new trends Aggregates / Detail

Billions of rows via Direct Discovery

Deep Application to confirm and take action

Broad Application to discover new trends Direct Discovery Deep Application to confirm and take action

Result from working with Hortonworks • Easy to setup • ODBC connects fine - read of data is straight forward • Can do qualified calls via the ODBC (SQL based calls)

• Direct discovery works best when used on aggregated level • HIVE is per definition not suited for interactive loads with many queries – hence not suited for Direct Discovery

MongoDB - New programming model – Object oriented programming - A Document Database - Simple and fast to implement

- No Complicated SQL (NOSQL) - Can be much faster than traditional SQL databases

MongoDB II - Can spread the DB across multiple machines - Limited multi-record transactional consistency, hence easier to implement across different machines

- Often used in web-applications - Back-end for mobile Apps

Two different approaches - MongoDB Direct Discovery

Use SIMBA ODBC

• Can access data from external sources into Qlik

• Access to MongoDB

• Will not load data until it is requested from the app

• Only meta data is loaded • Real time load of data

– Can read documents from the database – Can read complex objects from a document – Can read sub levels of each instance in the Collections



Qlik and MongoDB Broad Application to discover new trends

SIMBA ODBC

MongoDB

Direct reads Deep Application to confirm and take action

Result from working with MongoDB • Easy to setup • ODBC connects fine - read of data straight forward • Can do qualified calls via the ODBC (SQL based calls, although this is a NO SQL database) • Can read complex documents and read data on different levels • It is fast to retrieve data

Summary • Qlik is well suited for tapping into document database – MongoDB and read data and integrate into already existing analysis • It is recommended to use different strategies according to your needs – Direct Discovery when reading aggregated data – ODBC to read data on more detail level – Application chaining to swap between different levels of data • ODBC and in memory approach works best with Hortonworks. Hive is too slow use interactive access approaches • Big Data need strong visualization tools – in this context Qlik is well suited for this task

Small or Big Data - Result

Image: beautifulinsanity.com

Thank You