Performance Tuning Data Transfer Between RDB and Hadoop
Terry Koch Sr. Engineer Collier-IT tk@collier-it.com @tkoch_a
Agenda
• What's Sqoop? • How does it work? • How do I make it faster? • Other Best Practices • Where can I get more info? • Q & A WWW.COLLIER-IT.COM
$ yum install sqoop • •
Sqoop (SQL to Hadoop) is a CLI to move data between RDBs and Hadoop Sqoop 2 adds a web interface. – Sqoop 2 is not functionally equivalent yet. – USE SQOOP 2 WITH CAUTION
•
Generic JDBC + Dedicated Connectors for specific databases – -‐-‐direct invokes DB specific driver if available. Try these first. – Disclaimer – not all direct drivers support all features of generic
• •
Batch process, allows for incremental loads Two way data movement: – RDB >> Hadoop ( I refer to this as import ) – Hadoop >> RDB ( I refer to this as export )
WWW.COLLIER-IT.COM
How Sqoop works • Map only jobs each query for and deliver a part of the data • Parallelization *mostly determined b y H adoop (the number of mappers. • Data types determined by the RDB metadata • Data can be loaded in to Hive or HBase • HDFS, Hive, and HBase data can be exported to RDB * each DB will handle parallel a bit differently
WWW.COLLIER-IT.COM
Basics and quick note on Security •
Params vs. Properties • •
Params (mostly) specified by double dash Properties specified by –Dfull.property.name=myValue
[terry@localhost]$ /usr/bin/sqoop import \ -‐-‐connect jdbc:oracle:thin@localhost:1521:myDb \ -‐-‐username terry -‐P -‐-‐table mySchema.MyTable Enter password:
• Secure for command line only. – --password parameter is completely insecure – --password-file is not much better
WWW.COLLIER-IT.COM
Java Key Store – as of Sqoop 1.4.5 [terry@localhost]$ hadoop credential create \ mydb.password.alias –provider \ jceks://hdfs/user/terry/mydb.password.jceks Enter password: Enter password again: mydb.password has been successfully created. org.apache.Hadoop.security.alias.JavaKeyStoreProvider has been updated. [terry@localhost]$ /usr/bin/sqoop import \ -‐Dhadoop.security.credential.provider.path= \ jceks://hdfs/user/terry/mydb.password.jceks –-‐connect \ jdbc:oracle:thin:system/system@localhost:1521:xe \ -‐-‐username terry \ -‐-‐password-‐alias mydb.password.alias \ -‐-‐table system.MyTable
WWW.COLLIER-IT.COM
Other new and cool things in 1.45 • OraOop baked in to Sqoop! – Developed by Quest (Dell) – Enable using –direct=true
WWW.COLLIER-IT.COM
OraOop Design Goals
• Partition data to the mappers based on physical storage characteristics so that each mapper is allocated a completely distinct set of physical blocks and no physical block is read more than once. • Make sure each mapper receives an equal amount of work. • Bypass Oracle Database parallelism and the Oracle buffer cache. • Neither require nor use indexes. • Use Oracle “direct path” IO. WWW.COLLIER-IT.COM
Other new and cool things in 1.45
WWW.COLLIER-IT.COM
About those Mappers • Sqoop delagates processing to Hadoop • Each mapper transfers a slice of data • Number of mappers specified with the -‐–num-‐mappers parameter – Default is 4
• How many mappers, then? – – – –
It depends: What database? DB Server and cluster specs Other workloads running
WWW.COLLIER-IT.COM
About those Mappers • More mappers == faster jobs, up to a point, depending on: – – – –
Table size Sqoop Parameters Server specs Other load on DB and cluster
• More mappers == more DB activity == more load on the DB and slower performance for other DB users. Yes, they will notice.
WWW.COLLIER-IT.COM
Mapper Heuristics • More mappers == faster jobs, up to a point, depending on: – – – –
Table size Sqoop Parameters Server specs Other load on DB and cluster
• More mappers == more DB activity == more load on the DB and slower performance for other DB users. And yes, they will notice.
Exporting Data • Try batching it – --batch parameter sets 2 properties: • -Dsqoop.export.records.per.statement=100 • -Dsqoop.export.records.per.transaction=100
– Supported by all JDBC Drivers • Depending on DB, may worsen performance
– Set -Dsqoop.export.records.per.statement=[n] • Translates to: INSERT INTO myTable VALUES (…),(…)..;; • Some DBs limit query size, so can fail if [n] is too large
– Set --Dsqoop.export.records.per.transaction=[n] • Doesn’t affect query size, only commit frequency
– All 3 of these can be used together or seperately
WWW.COLLIER-IT.COM
Best Practices • Use options file for common parms Ø-‐-‐options-‐file= /users/terry/import.txt
• Use a staging table for export to RDB
Best Practices • With imports, use the output line formatting options wherever possible, for accuracy of data transfer Ø-‐-‐enclosed-‐by Ø-‐-‐fields-‐terminated-‐by Ø-‐-‐escaped-‐by
Best Practices • With imports, try using boundary queries for better performance • By default, primary key is used to split data • If no primary key or data is skewed, look at using: Ø-‐-‐split-‐by Ø-‐-‐boundary-‐query
Best Practices • File formats • Text (default) • Non-binary types • Platform independent • Easy to read / debug
• Binary (Avro, Sequence File) • Precise and efficient storage • Best for columns containing separators