Terry Koch

Report 11 Downloads 41 Views
Performance  Tuning  Data  Transfer  Between   RDB  and  Hadoop

Terry  Koch Sr.  Engineer Collier-­IT tk@collier-­it.com @tkoch_a

Agenda

• What's  Sqoop? • How  does  it  work? • How  do  I  make  it  faster? • Other  Best  Practices • Where  can  I  get  more  info? • Q  &  A WWW.COLLIER-­IT.COM

$  yum  install  sqoop • •

Sqoop (SQL  to  Hadoop)  is  a  CLI  to  move  data  between   RDBs  and  Hadoop Sqoop 2  adds  a  web  interface. – Sqoop 2  is  not  functionally  equivalent  yet. – USE  SQOOP  2  WITH  CAUTION



Generic  JDBC  +  Dedicated  Connectors  for  specific   databases – -­‐-­‐direct  invokes  DB  specific  driver  if  available.    Try  these  first. – Disclaimer  – not  all  direct  drivers  support  all  features  of  generic

• •

Batch  process,  allows  for  incremental  loads Two  way  data  movement: – RDB    >>  Hadoop     (  I  refer  to  this  as  import  ) – Hadoop  >>  RDB       (  I  refer  to  this  as  export  )

WWW.COLLIER-­IT.COM

How  Sqoop works • Map  only  jobs  each  query  for  and  deliver  a   part  of  the  data • Parallelization  *mostly  determined  b y  H adoop   (the  number  of  mappers. • Data  types  determined  by  the  RDB  metadata • Data  can  be  loaded  in  to  Hive  or  HBase • HDFS,  Hive,  and  HBase  data  can  be   exported  to  RDB *  each  DB  will  handle  parallel  a  bit  differently

WWW.COLLIER-­IT.COM

Basics  and  quick  note  on  Security •

Params  vs.  Properties • •

Params  (mostly)  specified  by  double  dash Properties  specified  by  –Dfull.property.name=myValue  

[terry@localhost]$  /usr/bin/sqoop import  \ -­‐-­‐connect  jdbc:oracle:thin@localhost:1521:myDb  \ -­‐-­‐username  terry  -­‐P -­‐-­‐table  mySchema.MyTable Enter  password:

• Secure  for  command  line  only.     – -­-­password  parameter  is  completely  insecure – -­-­password-­file  is  not  much  better

WWW.COLLIER-­IT.COM

Java  Key  Store  – as  of  Sqoop 1.4.5 [terry@localhost]$  hadoop credential  create  \ mydb.password.alias –provider  \ jceks://hdfs/user/terry/mydb.password.jceks Enter  password: Enter  password  again: mydb.password has  been  successfully  created. org.apache.Hadoop.security.alias.JavaKeyStoreProvider has   been  updated. [terry@localhost]$  /usr/bin/sqoop import  \ -­‐Dhadoop.security.credential.provider.path=  \ jceks://hdfs/user/terry/mydb.password.jceks –-­‐connect  \ jdbc:oracle:thin:system/system@localhost:1521:xe  \ -­‐-­‐username  terry  \ -­‐-­‐password-­‐alias  mydb.password.alias \ -­‐-­‐table  system.MyTable

WWW.COLLIER-­IT.COM

Other  new  and  cool  things  in  1.45 • OraOop baked  in  to  Sqoop! – Developed  by  Quest  (Dell) – Enable  using  –direct=true

WWW.COLLIER-­IT.COM

OraOop Design  Goals

• Partition  data  to  the  mappers  based  on   physical  storage  characteristics  so  that   each  mapper  is  allocated  a  completely   distinct  set  of  physical  blocks  and  no   physical  block  is  read  more  than  once. • Make  sure  each  mapper  receives  an  equal   amount  of  work. • Bypass  Oracle  Database  parallelism  and   the  Oracle  buffer  cache. • Neither  require  nor  use  indexes. • Use  Oracle  “direct  path”  IO. WWW.COLLIER-­IT.COM

Other  new  and  cool  things  in  1.45

WWW.COLLIER-­IT.COM

About  those  Mappers • Sqoop delagates processing  to  Hadoop • Each  mapper  transfers  a  slice  of  data • Number  of  mappers  specified  with  the   -­‐–num-­‐mappers parameter – Default  is  4

• How  many  mappers,  then? – – – –

It  depends: What  database? DB  Server  and  cluster  specs Other  workloads  running

WWW.COLLIER-­IT.COM

About  those  Mappers • More  mappers  ==  faster  jobs,  up  to  a  point,   depending  on: – – – –

Table  size Sqoop Parameters Server  specs Other  load  on  DB  and  cluster

• More  mappers  ==  more  DB  activity  ==  more   load  on  the  DB  and  slower  performance  for   other  DB  users.    Yes,  they  will  notice.

WWW.COLLIER-­IT.COM

Mapper  Heuristics • More  mappers  ==  faster  jobs,  up  to  a  point,   depending  on: – – – –

Table  size Sqoop Parameters Server  specs Other  load  on  DB  and  cluster

• More  mappers  ==  more  DB  activity  ==  more   load  on  the  DB  and  slower  performance  for   other  DB  users.    And  yes,  they  will  notice.

WWW.COLLIER-­IT.COM

sqoop.export.records.per.statement sqoop.export.records.per.statement sqoop.export.records.per.statement

Exporting  Data • Try  batching  it – -­-­batch  parameter  sets  2  properties: • -­Dsqoop.export.records.per.statement=100 • -­Dsqoop.export.records.per.transaction=100

– Supported  by  all  JDBC  Drivers • Depending  on  DB,  may  worsen  performance

– Set  -­Dsqoop.export.records.per.statement=[n] • Translates  to:  INSERT  INTO  myTable VALUES  (…),(…)..;; • Some  DBs  limit  query  size,  so  can  fail  if  [n]  is  too  large

– Set  -­-­Dsqoop.export.records.per.transaction=[n] • Doesn’t  affect  query  size,  only  commit  frequency

– All  3  of  these  can  be  used  together  or  seperately

WWW.COLLIER-­IT.COM

Best  Practices • Use  options  file  for  common   parms Ø-­‐-­‐options-­‐file=   /users/terry/import.txt

• Use  a  staging  table  for   export  to  RDB

Best  Practices • With  imports,  use  the  output  line   formatting  options  wherever   possible,  for  accuracy  of  data   transfer   Ø-­‐-­‐enclosed-­‐by Ø-­‐-­‐fields-­‐terminated-­‐by Ø-­‐-­‐escaped-­‐by

Best  Practices • With  imports,  try  using  boundary   queries  for  better  performance   • By  default,  primary  key  is  used  to   split  data • If  no  primary  key  or  data  is   skewed,  look  at  using: Ø-­‐-­‐split-­‐by Ø-­‐-­‐boundary-­‐query

Best  Practices • File  formats • Text  (default) • Non-­binary  types • Platform  independent • Easy  to  read  /  debug

• Binary  (Avro,  Sequence  File) • Precise  and  efficient  storage • Best  for  columns  containing  separators

$  bin/sqoop help

WWW.COLLIER-­IT.COM

All  things  Sqoop

WWW.COLLIER-­IT.COM

Q  &  A

Recommend Documents