Terry Koch

Comment

Report 11 Downloads 41 Views

Performance Tuning Data Transfer Between RDB and Hadoop

Terry Koch Sr. Engineer Collier-IT tk@collier-it.com @tkoch_a

Agenda

• What's Sqoop? • How does it work? • How do I make it faster? • Other Best Practices • Where can I get more info? • Q & A WWW.COLLIER-IT.COM

$ yum install sqoop • •

Sqoop (SQL to Hadoop) is a CLI to move data between RDBs and Hadoop Sqoop 2 adds a web interface. – Sqoop 2 is not functionally equivalent yet. – USE SQOOP 2 WITH CAUTION

•

Generic JDBC + Dedicated Connectors for specific databases – -‐-‐direct invokes DB specific driver if available. Try these first. – Disclaimer – not all direct drivers support all features of generic

• •

Batch process, allows for incremental loads Two way data movement: – RDB >> Hadoop ( I refer to this as import ) – Hadoop >> RDB ( I refer to this as export )

WWW.COLLIER-IT.COM

How Sqoop works • Map only jobs each query for and deliver a part of the data • Parallelization *mostly determined b y H adoop (the number of mappers. • Data types determined by the RDB metadata • Data can be loaded in to Hive or HBase • HDFS, Hive, and HBase data can be exported to RDB * each DB will handle parallel a bit differently

WWW.COLLIER-IT.COM

Basics and quick note on Security •

Params vs. Properties • •

Params (mostly) specified by double dash Properties specified by –Dfull.property.name=myValue

[terry@localhost]$ /usr/bin/sqoop import \ -‐-‐connect jdbc:oracle:thin@localhost:1521:myDb \ -‐-‐username terry -‐P -‐-‐table mySchema.MyTable Enter password:

• Secure for command line only. – --password parameter is completely insecure – --password-file is not much better

WWW.COLLIER-IT.COM

Java Key Store – as of Sqoop 1.4.5 [terry@localhost]$ hadoop credential create \ mydb.password.alias –provider \ jceks://hdfs/user/terry/mydb.password.jceks Enter password: Enter password again: mydb.password has been successfully created. org.apache.Hadoop.security.alias.JavaKeyStoreProvider has been updated. [terry@localhost]$ /usr/bin/sqoop import \ -‐Dhadoop.security.credential.provider.path= \ jceks://hdfs/user/terry/mydb.password.jceks –-‐connect \ jdbc:oracle:thin:system/system@localhost:1521:xe \ -‐-‐username terry \ -‐-‐password-‐alias mydb.password.alias \ -‐-‐table system.MyTable

WWW.COLLIER-IT.COM

Other new and cool things in 1.45 • OraOop baked in to Sqoop! – Developed by Quest (Dell) – Enable using –direct=true

WWW.COLLIER-IT.COM

OraOop Design Goals

• Partition data to the mappers based on physical storage characteristics so that each mapper is allocated a completely distinct set of physical blocks and no physical block is read more than once. • Make sure each mapper receives an equal amount of work. • Bypass Oracle Database parallelism and the Oracle buffer cache. • Neither require nor use indexes. • Use Oracle “direct path” IO. WWW.COLLIER-IT.COM

Other new and cool things in 1.45

WWW.COLLIER-IT.COM

About those Mappers • Sqoop delagates processing to Hadoop • Each mapper transfers a slice of data • Number of mappers specified with the -‐–num-‐mappers parameter – Default is 4

• How many mappers, then? – – – –

It depends: What database? DB Server and cluster specs Other workloads running

WWW.COLLIER-IT.COM

About those Mappers • More mappers == faster jobs, up to a point, depending on: – – – –

Table size Sqoop Parameters Server specs Other load on DB and cluster

• More mappers == more DB activity == more load on the DB and slower performance for other DB users. Yes, they will notice.

WWW.COLLIER-IT.COM

Mapper Heuristics • More mappers == faster jobs, up to a point, depending on: – – – –

Table size Sqoop Parameters Server specs Other load on DB and cluster

• More mappers == more DB activity == more load on the DB and slower performance for other DB users. And yes, they will notice.

WWW.COLLIER-IT.COM

sqoop.export.records.per.statement sqoop.export.records.per.statement sqoop.export.records.per.statement

Exporting Data • Try batching it – --batch parameter sets 2 properties: • -Dsqoop.export.records.per.statement=100 • -Dsqoop.export.records.per.transaction=100

– Supported by all JDBC Drivers • Depending on DB, may worsen performance

– Set -Dsqoop.export.records.per.statement=[n] • Translates to: INSERT INTO myTable VALUES (…),(…)..;; • Some DBs limit query size, so can fail if [n] is too large

– Set --Dsqoop.export.records.per.transaction=[n] • Doesn’t affect query size, only commit frequency

– All 3 of these can be used together or seperately

WWW.COLLIER-IT.COM

Best Practices • Use options file for common parms Ø-‐-‐options-‐file= /users/terry/import.txt

• Use a staging table for export to RDB

Best Practices • With imports, use the output line formatting options wherever possible, for accuracy of data transfer Ø-‐-‐enclosed-‐by Ø-‐-‐fields-‐terminated-‐by Ø-‐-‐escaped-‐by

Best Practices • With imports, try using boundary queries for better performance • By default, primary key is used to split data • If no primary key or data is skewed, look at using: Ø-‐-‐split-‐by Ø-‐-‐boundary-‐query

Best Practices • File formats • Text (default) • Non-binary types • Platform independent • Easy to read / debug

• Binary (Avro, Sequence File) • Precise and efficient storage • Best for columns containing separators

$ bin/sqoop help

WWW.COLLIER-IT.COM

All things Sqoop

WWW.COLLIER-IT.COM

Q & A

Recommend Documents