There and back again

Comment

Report 0 Downloads 188 Views

There and back again. or how to connect Oracle and Big Data.

SESSION ID GLEB OTOCHKIN

February 20-22, 2018 | Westin Westminster Hotel | Westminster, CO

www.rmoug.org

Gleb Otochkin Principal Consultant Started to work with data in 1992 At Pythian since 2008 Area of expertise: ● Data Integration ● Oracle RAC ● Oracle engineered systems ● Virtualization ● Performance tuning ● Big Data [email protected] @sky_vst © The Pythian Group Inc., 2018

2

ABOUT PYTHIAN Pythian’s 400+ IT professionals help companies adopt and manage disruptive technologies to better compete

© The Pythian Group Inc., 2018

3

EXPERIENCED

GLOBAL

EXPERTS

11,800

400

2

Systems currently managed by Pythian

Pythian experts in 35 countries

Millennia of experience gathered and shared over 19 years

© The Pythian Group Inc., 2018

4

Journey of our data begins here.

© The Pythian Group Inc., 2018

5

AGENDA • • • • •

What do we call by name “Big Data”? Business cases. Real time replication from Oracle to Big Data. How to use Big Data in Oracle. QA

© The Pythian Group Inc., 2018

6

Big Data?

© The Pythian Group Inc., 2018

7

CLICK TO ADD TITLE Big Data? Big Data ecosystem

Here in Texas we call it just Data.

© The Pythian Group Inc., 2018

CLICK TO ADD TITLE

Big Data ecosystem SUBTITLE

BIG COMPLEX

VOLUME ANY STRUCTURE

ANALYSIS

GROWS

PROCESSING DISCOVERY

© The Pythian Group Inc., 2018

Some Big Data tools and terms. Some terms and tools used in Big DATA:

• • • • • • • • •

HDFS - Hadoop Distributed File System. Apache Hadoop - framework for distributed storage and processing. HBase - non-relational, distributed database. Kafka - streaming processing data platform. Flume - streaming, aggregation data framework. Cassandra - open-source distributed NoSQL database. Mongodb - open-source cross-platform document-oriented database. Hive - query and analysis data framework on top of Hadoop. Avro - data format widely used in BD (JSON+binary) © The Pythian Group Inc., 2018

10

Business cases

© The Pythian Group Inc., 2018

11

Why do we need replication? Business cases.

Business cases. Operation activity on DB with rest on a Data Lake. OLTP

OLTP

RDBMS

RDBMS

BD platform

Data Preparation Engine

BI & Analysis

© The Pythian Group Inc., 2018

13

Business cases. Operation activity on DB with Data in Big Data and BI on a RDBMS. OLTP

RDBMS

BD platform

OLTP

RDBMS

RDBMS

BI & Analysis

Data Preparation Engine © The Pythian Group Inc., 2018

14

Business cases. Operation and BI activity on DB with main Data body in a Big Data platform. OLTP

OLTP

RDBMS

RDBMS

BI & Analysis

ODI & BD SQL

BD platform

© The Pythian Group Inc., 2018

15

Business cases. Operation and BI activity on DB with main Data body in a Big Data platform. OLTP

OLTP

RDBMS

RDBMS

RDBMS

Kafka

Kafka

BI & Analysis

Kafka

BD platforms

© The Pythian Group Inc., 2018

16

Business cases. Operation and BI activity on DB with main Data body in a Big Data platform. OLTP

OLTP

RDBMS

Stream processing

RDBMS

Kafka

Kafka

BI & Analysis

Kafka

BD platforms

© The Pythian Group Inc., 2018

17

Business cases. Short summary

•

Reasons:

• • • •

Growing Data Volume. Data retention policy. Cost. New options and API.

• How we connect different platforms: •Data replication tools. ▪Oracle GoldenGate. ▪DBVisit. ▪Shareplex. ▪ELT tools

•Batch load tools. ▪Sqoop. ▪Oracle loader for Hadoop

•Data Integration tools. ▪Oracle Data Integrator.

•Presenting Big Data to RDBMS: ▪Oracle Big Data SQL. ▪Gluent. © The Pythian Group Inc., 2018

18

Oracle to big data. Real time replication OLTP data to a Big Data platform.

Replication to Big Data. Oracle Goldengate:

Why Goldengate?

• Proc. •Real time streaming. •No or minimal impact to the source. •Supports most of the data types. •Different formats and conversion. •Enterprise support

• Cons. •Licensing fee. •Closed source code.

© The Pythian Group Inc., 2018

20

Replication to Big Data. Replication to HDFS by Oracle GoldenGate. Oracle Goldengate

App

Oracle Goldengate

HDFS

App

HDFS

App

HDFS

App App

Database

Tran Log

OGG Trail

HDFS HDFS

© The Pythian Group Inc., 2018

21

Replication to Big Data. Source side.

• Oracle 11.2.0.4 and up • Archivelog mode. • Supplemental logging:

App App App App App

Database

Tran Log

•Minimal on DB level. •On schema or table level.

• OGG user in database.

© The Pythian Group Inc., 2018

22

Replication to Big Data. Oracle GoldenGate.

Manager Extract

Tran Log

Data Pump

OGG Trail

Manager Replicat

OGG Trail

• With the latest OGG for BD: •JDBC , Elasticsearch, Kinesis, Kafka Connect ...

• OGG 12.2.0.1 and up. • BD adapters with replicat. • Supported BD targets: •HDFS •Kafka •Flume •HBase •Mongodb (from 12.3) •Cassandra (from 12.3)

• Different formats. © The Pythian Group Inc., 2018

23

Replication to Big Data. To HDFS.

Manager

HDFS Replicat

HDFS Client OGG Trail

HDFS HDFS

• OGG 12.2.0.1 and up. • BD adapters with replicat. • Different formats. • DML and DDL.

HDFS HDFS

© The Pythian Group Inc., 2018

24

Replication to Big Data. Test two. orcl> select * from ggtest.test_tab_2; ID RND_STR USE_DATE 1 BGBXRKJL 02/13/16 08:34:19 2 FNMCEPWE 08/17/15 04:50:18

• New columns: •Operation type. •Table name •Local and UTC timestamp

hive> select * from BDTEST.TEST_TAB_2; OK I BDTEST.TEST_TAB_2 2016-10-25 01:09:16.000168 2016-02-13:08:34:19 NULL

2016-10-24T21:09:21.186000

00000000120000004759

1

I BDTEST.TEST_TAB_2 2016-10-25 01:09:16.000168 FNMCEPWE 2015-08-17:04:50:18 NULL

2016-10-24T21:09:22.827000

00000000120000004921

2

BGBXRKJL

© The Pythian Group Inc., 2018

25

Replication to Big Data. Journal of changes instead of a state

HDFS

Database Insert row 1 Row 1 Insert row 2

I | Row 2

Insert row 3

I | Row 3

Update row 3

U | Row 3

Row 2 Row 3 Row 3

I | Row 1

D | Row 2 Delete row 2

© The Pythian Group Inc., 2018

26

Replication to Big Data. To Kafka. Zookeeper

Manager

Broker

Stream Kafka Consumer

Replicat Kafka producer OGG Trail

Kafka topic

Kafka Consumer

Kafka topic

Kafka Consumer

© The Pythian Group Inc., 2018

27

Replication to Big Data. Some notes about Kafka handler:

• •

Version 0.9+. Schema definition can go to a different topic(Avro).

and others:

• Different topologies for Kafka. • Topic partitioning to table. • Topics are created automatically. • Operation vs transactional mode. • Blocking vs Non-Blocking.

© The Pythian Group Inc., 2018

28

Replication to Big Data. Some notes about source:

• • • • • •

We are replicating a log of changes. Only committed transactions. “Passive” commit. Different levels of supplemental logging may lead to different captured data (compressed DML). DDL only with following DML. CTAS is not supported.

and BD destination:

• JSON,Text,Avro(different types) and XML formats.

• Some strange behaviour with Avro. • Hive support through text format. • Trimming out leading or trailing whitespaces.

• Truncates only as DML. • Proper path to Java classes.

© The Pythian Group Inc., 2018

29

Kafka connect and Kafka stream. Confluent team and their additions.

•

Kafka connect.

• • •

•

Kafka Streams and KSQL(beta).

Frame to build connector.

•

https://www.confluent.io/product /connectors/

Incorporated to Kafka and works together.

•

Main purpose is copy data to and from Kafka.

https://www.confluent.io/product/kaf ka-streams/

•

Input data are transformed to output data.

© The Pythian Group Inc., 2018

30

Replication to Big Data. To Flume.

Manager

Flume agent Replicat Flume source

Flume channel

Flume sink

HDFS

OGG Trail

© The Pythian Group Inc., 2018

31

Replication to Big Data. Some notes about Flume handler:

• • •

Version 1.4+. (1.6.x better) Avro and Thrift formats. Schema changes go to a different Flume event.

and others:

• Configuration file path. • Failover and load balancing (from 1.6.x for Avro).

• Kerberos security (from 1.6.x for Thrift).

© The Pythian Group Inc., 2018

32

Replication to Big Data. To HBase. MongoDB. Cassandra.

Manager

HBASE Replicat

Hbase client

HDFS

OGG Trail

© The Pythian Group Inc., 2018

33

Replication to Big Data. Primary key for a source table in HBase.

• Having supplemental logging for all columns: •Row id as concatenation of values for all columns.

• Adding a primary key and supplemental logging for keys: •Row id is primary key.

orcl> alter table ggtest.test_tab_2 add constraint pk_test_tab_2 primary key (pk_id);

Table altered.

orcl> insert into ggtest.test_tab_2 values(9,'PK_TEST',sysdate,null);

© The Pythian Group Inc., 2018

34

Replication to Big Data. Two cases. All columns supplemental logging and PK supplemental logging. hbase(main):012:0> scan 'BDTEST:TEST_TAB_2' ROW

COLUMN+CELL

7|IJWQRO7T|2013-07-07:08:13:52

column=cf:ACC_DATE, timestamp=1459275116849, value=2013-07-07:08:13:52

7|IJWQRO7T|2013-07-07:08:13:52

column=cf:PK_ID, timestamp=1459275116849, value=7

7|IJWQRO7T|2013-07-07:08:13:52

column=cf:RND_STR_1, timestamp=1459275116849, value=IJWQRO7T

8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:ACC_DATE, timestamp=1459278884047, value=2016-03-29:15:14:37 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:PK_ID, timestamp=1459278884047, value=8 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:RND_STR_1, timestamp=1459278884047, value=TEST_INS1 8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:TEST_COL, timestamp=1459278884047, value=TEST_ALTER 9

column=cf:ACC_DATE, timestamp=1462473865704, value=2016-05-05:14:44:19

9

column=cf:PK_ID, timestamp=1462473865704, value=9

9

column=cf:RND_STR_1, timestamp=1462473865704, value=PK_TEST

9

column=cf:TEST_COL, timestamp=1462473865704, value=NULL

© The Pythian Group Inc., 2018

35

Replication to Big Data. Some notes about HBase handler:

•

Version 1.1+. (0.98 with compat parameter)

• •

HBase row key mapping. Only single column family.

and others:

• Hbase-site.xml in the classpath. • GROUPTRANSOPS batching for performance.

• Kerberos security (from 1.6.x for Thrift).

© The Pythian Group Inc., 2018

36

Replication to Big Data. Some notes about Cassandra handler:

and Mongodb:

Does not automatically create keyspaces (but creates tables)

• Does not allow the _id column to be

• •

Primary keys are immutable.

• Undo for bulk transactions (requires

•

Primary key updates are deletes and inserts (watch compressed updates in OGG).

•

Inserts and Updates are different than in traditional databases.

modified. full before image for deletes in OGG).

• Primary key updates are deletes and inserts (watch compressed updates in OGG).

© The Pythian Group Inc., 2018

37

Other tools. What can be used as an alternative.

•

DBvisit.

• • •

Replicate connector to Kafka.

•

Has similar functionality to Oracle GoldenGate.

Confluent version of Kafka. http://www.dbvisit.com/replicate _connector_for_kafka

•

Dell Shareplex.

• • •

Replication to Kafka.

•

https://documents.software.dell.com /shareplex/8.6.4/administration-guid e/configure-replication-to-open-targ et-targets/configure-replication-to-akafka-target

Supports most of datatypes. http://documents.software.dell.com/ SharePlex/8.6.5/release-notes/syst em-requirements

© The Pythian Group©Inc., The2017 Pythian Group Inc., 2018

38

And what is wrong with Sqoop?

© The Pythian Group Inc., 2018

Batch offloading from Oracle to Big Data. To HDFS.

SQOOP

HDFS HDFS HDFS

Database

• Sqoop one of the oldest. • Works in both ways. • Gluent is more than a just batch tool.

HDFS HDFS

© The Pythian Group Inc., 2018

40

Replication to Big Data. Sqoop import. Generate MapReduce Jobs. Connects through JDBC. Good for batch processing. App

HDFS

Sqoop

App

HDFS

App App App

Hive Database

JDBC

MapReduce

HDFS HDFS HDFS

© The Pythian Group Inc., 2018

41

Back again Using data from Big Data in Oracle

From Big Data to Oracle. Tool

Filtering

Data Conversion

Query

Sqoop

Oracle

Oracle

ODBC gateway

Hadoop

Oracle

Oracle loader for HDFS

Oracle

Hadoop

Oracle SQL Connector for HDFS

Oracle

Oracle

Yes

BD SQL

Hadoop

Hadoop

Yes

ODI

KM

KM

Yes

Gluent

Hadoop

Hadoop

Yes

Offload query

Parallel Yes

Yes

Yes Yes

Yes

Yes

Yes Yes

Yes

Yes © The Pythian Group Inc., 2018

43

Oracle Data Integrator. ODI vs traditional ETL.

Extract Source I

Transform

Load

Intermediate staging and transformation

Target

Source II

Extract

Load

Transform Transform

Source I Load engine Source II

Target

© The Pythian Group Inc., 2018

44

Oracle Data Integrator. How it looks.

Oracle DB I

Weblogic

Oracle DB II ODI Agent Staging schema HDFS

Work repository

Target schema

CSV text

Master repository

ODI Studio © The Pythian Group Inc., 2018

45

Oracle Data Integrator. ODI workflow. Designer

Logical view

Context

Project I

PROD

Project II

PROD Logical schema I

Data model Data model

Physical

Logical schema II

Prod Batch Prod Stream

PROD Stream

Test

PROD Stream

Data model

Test env

Data model

Test env

© The Pythian Group Inc., 2018

46

Oracle Data Integrator. ODI logical mapping

• Logical mapping is separated from physical implementation.

© The Pythian Group Inc., 2018

47

Oracle Data Integrator. ODI physical mapping.

• Different physical mapping for the same logical map

• Different knowledge modules.

© The Pythian Group Inc., 2018

48

ODI Oracle Data Integrator.

•

• • •

• •

• Knowledge modules for most

ELT: Extract. Load. Transform.

Logical and physical models separation. Knowledge modules:

• • •

platforms.

• Uses filters, mappings, joins and constraints.

• Batch or event(stream) oriented integration.

RKM - reverse engineering. LKM - load. IKM - integration.

© The Pythian Group Inc., 2018

49

Oracle Big Data SQL Architecture. BD SQL

CDH or HDP management server BD SQL agent/service

Oracle Database

Hadoop

Hadoop

Exadata technology for Big Data © The Pythian Group Inc., 2018

50

Oracle Big Data SQL. What it can do for us.

•

Works with:

• • • •

• •

Apache Hive. HDFS Oracle NoSQL Database.

• Smart scan: • Storage indexes. • Bloom filters.

Apache HBase

Fully supports SQL syntax. Predicate push down.

© The Pythian Group Inc., 2018

51

Oracle Big Data SQL. Views and packages to use. orcl> select cluster_id,database_name, owner, table_name from all_hive_tables where database_name='bdtest'; CLUSTER_ID DATABASE_NAME OWNER TABLE_NAME bigdatalite bdtest

oracle test_tab_1

bigdatalite bdtest

oracle test_tab_2

• Smart scan: • Storage indexes.

• Bloom filters.

PROCEDURE CREATE_EXTDDL_FOR_HIVE Argument Name

Type

In/Out Default?

------------------------------ ----------------------- ------ -------CLUSTER_ID DB_NAME HIVE_TABLE_NAME HIVE_PARTITION TABLE_NAME PERFORM_DDL TEXT_OF_DDL

VARCHAR2 VARCHAR2

IN IN

VARCHAR2

IN

PL/SQL BOOLEAN VARCHAR2

IN IN

PL/SQL BOOLEAN CLOB

IN

DEFAULT

OUT

© The Pythian Group Inc., 2018

52

Oracle Big Data SQL. Building the external table. CREATE TABLE test_tab_2 ( tran_flag VARCHAR2(4000), tab_name VARCHAR2(4000), ………..) ORGANIZATION EXTERNAL

(TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS

( com.oracle.bigdata.cluster=bigdatalite com.oracle.bigdata.tablename=bdtest.test_tab_2) ) PARALLEL 2 REJECT LIMIT UNLIMITED

TRAN_FLAG ID

RND_STR

USE_DATE

---------- ---------- ---------- -------------------I 1

BGBXRKJL

2016-02-13:08:34:19

I 2

FNMCEPWE

2015-08-17:04:50:18

I 1

BGBXRKJL

2016-02-13:08:34:19

• Hive table can be queried in Oracle now.

© The Pythian Group Inc., 2018

53

Oracle Big Data SQL. What it can do for us.

• External table FULL access. © The Pythian Group Inc., 2018

54

Oracle Big Data SQL. Options for BD SQL.

•

ACCESS PARAMETERS :

• • • •

•

com.oracle.bigdata.overflow com.oracle.bigdata.tablename com.oracle.bigdata.colmap com.oracle.bigdata.erroropt

Directly to HDFS :

•

TYPE oracle_hdfs.

• Oracle Copy to Hadoop utility. •Using Data Pump format. •ORACLE_DATAPUMP access driver. •CTAS to external table. •hadoop fs -put … •Hive “INPUTFORMAT 'oracle.hadoop.hive.datapump.DPInputFormat'” •Oracle Shell for Hadoop Loaders

© The Pythian Group Inc., 2018

55

Other tools. What can be used as an alternative.

•

Gluent Data Platform.

• •

Can offload data to Hadoop.

• •

Offload engine .

Allows query data of any data source in Hadoop. Supports Cloudera, Hortonworks or any Hadoop with Impala or or Hive SQL engine installed.

https://gluent.com/

© The Pythian Group Inc., 2018

56

And here we have returned back.

© The Pythian Group Inc., 2018

57

QA

© The Pythian Group Inc., 2018

58

THANK YOU Email: [email protected] https://pythian.com/blog @sky_vst

© The Pythian Group Inc., 2018

59

Recommend Documents

Justin Marinko âThere and Back Again a Proteins Tale

From Independence to Expansion and Back Again