Install vm HDP VB

Report 0 Downloads 71 Views
 

Waterline  Data  Inventory  Sandbox   for  HDP  2.1  and  VirtualBox   Product  Version  1.1.0   Document  Version  1.1  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.   All  other  trademarks  are  the  property  of  their  respective  owners.    

Waterline  Data  Inventory  

 

Sandbox  for  HDP  2.1  and  VirtualBox  

Table  of  Contents   Overview  .....................................................................................................................................  3   Related  Documents  .................................................................................................................  3   System  requirements  .............................................................................................................  3   Setting  up  Waterline  Data  Inventory  VM  sandbox  for  VirtualBox  ..........................  4   Running  Waterline  Data  Inventory  .............................................................................................  4   Opening  Waterline  Data  Inventory  in  a  browser  ...................................................................  5   Exploring  the  sample  cluster  .........................................................................................................  5   Accessing  the  Hadoop  cluster  using  SSH  .........................................................................  8   Loading  data  into  HDFS  .........................................................................................................  8   Using  Hue  to  load  files  into  HDFS  .................................................................................................  8   Loading  files  into  HDFS  from  a  command  line  ........................................................................  9   Running  Waterline  Data  Inventory  jobs  .......................................................................  11   Monitoring  Waterline  Data  Inventory  jobs  .................................................................  13   Monitoring  Hadoop  jobs  ..............................................................................................................  13   Monitoring  local  jobs  ....................................................................................................................  14   Debugging  information  ................................................................................................................  14   Configuring  additional  Waterline  Data  Inventory  functionality  ..........................  15   Profiling  functionality  ..................................................................................................................  15   Hive  functionality  ...........................................................................................................................  17   Discovery  functionality  ................................................................................................................  17   Accessing  Hive  tables  ..........................................................................................................  18   Viewing  Hive  tables  in  Hue  .........................................................................................................  19   Connecting  to  the  Hive  datastore  ..............................................................................................  19        

 

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

2    

Waterline  Data  Inventory  

 

Sandbox  for  HDP  2.1  and  VirtualBox  

Overview   Waterline  Data  Inventory  reveals  information  about  the  metadata  and  data  quality   of  files  in  a  Apache™  Hadoop®  cluster  so  the  users  of  the  data  can  identify  the  files   they  need  for  analysis  and  downstream  processing.  The  application  installs  on  an   edge  node  in  the  cluster  and  runs  MapReduce  jobs  to  collect  data  and  metadata  from   files  in  HDFS  and  Hive.  It  then  discovers  relationships  and  patterns  in  the  profiled   data  and  stores  the  results  in  its  metadata  repository.  A  browser  application  lets   users  search,  browse,  and  tag  HDFS  files  and  Hive  tables  using  the  benefits  of  the   collected  metadata  and  Data  Inventory’s  discovered  relationships.   This  document  describes  setting  up  a  virtual  machine  image  that  is  pre-­‐configured   with  the  Waterline  Data  Inventory  application  and  sample  cluster  data.  The  image  is   built  from  Hortonworks™  HDP  2.1  sandbox  on  Oracle®  VirtualBox™.  

Related  Documents   •

Waterline  Data  Inventory  User  Guide  (also  available  from  the   browser  application)  

 menu  in  the  

For  the  most  recent  documentation  and  product  tutorials,  see   downloads.waterlinedata.com/documents.  

System  requirements   Waterline  Data  Inventory  sandbox  is  available  inside  the  Hortonworks  HDP  2.1   sandbox.  The  system  requirements  and  installation  instructions  are  the  same  as   Hortonworks  describes:   hortonworks.com/products/hortonworks-­‐sandbox/  -­‐  install   The  Waterline  Data  Inventory  sandbox  is  configured  with  8  GB  of  physical  RAM   rather  than  the  default  of  4  GB.   The  basic  requirements  are  as  follows   For  your  host  computer:   •

64-­‐bit  computer  that  supports  virtualization.     VirtualBox  describes  the  unlikely  cases  where  your  hardware  may  not  be   compatible  with  64-­‐bit  virtualization:   www.virtualbox.org/manual/ch10.html  -­‐  hwvirt  



Operating  system  supported  by  VirtualBox,  including  Microsoft®  Windows®  (XP   and  later),  many  Linux  distributions,  Apple®  Mac®  OS  X,  Oracle  Solaris®,  and   OpenSolaris™.   www.virtualbox.org/wiki/End-­‐user_documentation  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

3    

Waterline  Data  Inventory   • •

 

Sandbox  for  HDP  2.1  and  VirtualBox  

At  least  10  GB  of  RAM   VirtualBox  virtualization  application  for  your  operating  system.  Download  the   latest  version  from  here:   www.virtualbox.org  



Waterline  Data  Inventory  VM  image  built  on  Hortonworks  HDP  2.1  sandbox,   VirtualBox  version.   downloads.waterlinedata.com  

Browser  compatibility   • • • •

Microsoft  Internet  Explorer  10  and  later  (not  supported  on  Mac  OS)   Chrome  36  or  later   Safari  6  or  later   Firefox  31  or  later  

Setting  up  Waterline  Data  Inventory  VM  sandbox  for  VirtualBox   1. Install  VirtualBox.   2. Download  the  Waterline  Data  Inventory  VM  (.ova  file)   3. Open  the  .ova  file  with  VirtualBox  (double-­‐click  the  file).   4. Click  Import  to  accept  the  default  settings  for  the  VM.   This  will  take  a  few  minutes  to  expand  the  archive  and  create  the  guest   environment.   5. (Optional)  Configure  a  way  to  easily  move  files  between  the  host  and  guest.   Some  options  are:   • Configure  a  shared  directory  between  the  host  and  guest.  (Settings  >  Shared   Folders,  specify  auto-­‐mount)   • Setup  a  bi-­‐directional  clipboard   6. Start  the  VM.   It  will  take  a  few  minutes  for  Hadoop  and  its  components  startup.   7. Note  the  IP  address  used  for  SSH  access,  such  as  127.0.0.1.   8. Log  in  as  waterlinedata/waterlinedata.  

Running  Waterline  Data  Inventory   1. Open  a  terminal  or  command  prompt  on  the  host  and  connect  to  the  guest.   ssh [email protected] -p2222

Enter  the  password  when  prompted  ("waterlinedata").  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

4    

Waterline  Data  Inventory  

 

Sandbox  for  HDP  2.1  and  VirtualBox  

2. Start  the  embedded  metadata  repository  database,  Derby.   cd waterlinedata bin/derbyStart

You'll  see  a  response  that  ends  with  "...started  and  ready  to  accept  connections   on  port  4444".  Type  Enter  to  return  to  the  shell  prompt.   3. Start  the  embedded  web  server,  Jetty.   bin/jettyStart

The  console  fills  with  status  messages  from  Jetty.  Only  messages  identified  by   "ERROR"  or  "exception"  indicate  problems.   You  are  now  ready  to  use  the  application  and  its  sample  data.    

Opening  Waterline  Data  Inventory  in  a  browser   The  sandbox  includes  pre-­‐profiled  data  so  you  can  see  the  functionality  of  Waterline   Data  Inventory  before  you  load  your  own  data.   1. Open  a  browser  to  the  Waterline  Data  Inventory  application:   http://localhost:8082

or   http://:8082

2. Sign  into  Waterline  Data  Inventory  using  any  of  the  Linux  users  configured  for   your  system,  including  "waterlinedata".   3. The  VM  image  is  configured  with  the  following  additional  ports  that  allow  access   to  the  guess:    

Port  

Application  Component  

8082   10000   19888   4444   8000  

Waterline  Data  Inventory  browser  application   Hive   Hadoop  job  history   Derby   Hue  

Exploring  the  sample  cluster   The  Waterline  Data  Inventory  sandbox  is  pre-­‐populated  with  public  data  to  simulate   a  set  of  users  analyzing  and  manipulating  the  data.  As  you  might  expect  among  a   group  of  users,  there  are  multiple  copies  of  the  same  data,  standards  for  file  and   field  names  are  not  consistent,  and  data  is  not  always  wrangled  into  forms  that  are   immediately  useful  for  analysis.  In  other  words,  the  data  is  intended  to  reflect   reality.  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

5    

Waterline  Data  Inventory  

 

Sandbox  for  HDP  2.1  and  VirtualBox  

Here  are  some  entry  points  to  help  you  use  this  sample  data  to  explore  the   capabilities  of  Waterline  Data  Inventory:   Tags   Tags  help  you  identify  data  that  you  may  want  to  use  for  analysis.  When  you   place  tags  on  fields,  Waterline  Data  Inventory  looks  for  similar  data  across  the   profiled  files  in  the  cluster  and  suggests  your  tags  for  other  fields.  Use  the  tags   you  enter  and  automatically  suggested  tags  in  searches  and  search  filtering  with   facets.   In  the  sample  data,  look  for  tags  for  "Food  Service"  data.  

  Lineage  relationships,  landings,  and  origins   Waterline  Data  Inventory  uses  file  metadata  and  data  to  identify  cluster  files  that   are  related  to  each  other.  It  finds  copies  of  the  same  data,  joins  between  files,  and   horizontal  and  vertical  subsets  of  files.  If  you  mark  the  places  where  data  comes   into  the  cluster  with  "Landing"  labels,  Waterline  Data  Inventory  propagates  this   information  through  the  lineage  relationships  to  show  the  origin  of  the  data.   In  the  sample  data,  look  for  origins  for  "data.gov,"  "Twitter,"  and  "Restaurant   Inspections."  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

6    

Waterline  Data  Inventory  

 

Sandbox  for  HDP  2.1  and  VirtualBox  

  Searching  with  facets   Use  the  Global  Search  text  box  on  the  top  of  the  page  to  do  keyword  searches   across  your  cluster  metadata,  including  searching  on  file  and  field  names,  tags   and  tag  descriptions,  50  examples  of  the  most  frequent  data  in  each  field.   Waterline  Data  Inventory  also  provides  search  facets  on  common  file  and  field   properties,  such  as  file  size  and  data  density.  Some  of  the  most  powerful  facets   are  those  for  tags  and  origins.  Use  the  facet  lists  on  the  Advance  Search  page  to   identify  what  kind  of  data  you  want  to  find.  Then  use  facets  in  the  left  pane  to   refine  the  search  results  further.   In  the  sample  data,  use  "Food  Service"  tags  in  the  Advance  Search  page,  then   filter  the  results  by  origin,  such  as  "Restaurant  Inspections".  

 

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

7    

Waterline  Data  Inventory  

 

Sandbox  for  HDP  2.1  and  VirtualBox  

Accessing  the  Hadoop  cluster  using  SSH   To  run  Waterline  Data  Inventory  jobs  and  to  upload  files  in  bulk  to  HDFS,  you  will   want  to  access  the  guest  machine  using  a  command  prompt  or  terminal  on  your  host   computer  through  a  Secure  Shell  (SSH)  connection.  Alternatively,  you  can  use  the   terminal  in  the  guest  VirtualBox  window,  but  that  can  be  awkward.   1. In  a  terminal  window  (Mac)  or  command  prompt  (Windows),  start  an  SSH   session  using  the  IP  address  provided  for  the  guest  instance  and  the  username   waterlinedata,  all  lower  case:   ssh waterlinedata@ -p2222

or   ssh waterlinedata@localhost -p2222

2. You  may  be  prompted  to  continue  connecting  though  the  authenticity  of  the  host   cannot  be  established.  Enter  yes.   3. Enter  the  waterlinedata  user  password  "waterlinedata".  

Loading  data  into  HDFS   Loading  data  into  HDFS  is  a  two  stage  process:  first  you  load  data  from  its  source—   such  as  your  local  computer  or  a  public  website—to  the  guest  file  system.  Then  you   copy  the  data  from  the  guest  file  system  into  HDFS.  For  a  small  number  of  files,  the   Hadoop  utility  Hue  makes  this  process  very  easy  by  allowing  you  to  select  files  from   the  host  computer  and  copy  them  directly  into  HDFS.  For  larger  files  or  large   numbers  of  files,  you  may  decide  to  use  a  combination  of  an  SSH  client  (to  move  files   to  the  guest  machine)  and  a  command-­‐line  operation  (to  move  files  from  the  guest   file  system  to  HDFS).  If  you  have  a  shared  directory  configured  between  the  host   and  guest,  you  can  access  the  files  directly  from  the  guest.  

Using  Hue  to  load  files  into  HDFS   To  access  Hue  from  a  browser  on  the  host  computer:   http://:  Directory  

Create  a  new  directory  inside  the  current  directory.  Feel  free  to  create   additional  /user  directories.     Note:  Avoid  adding  directories  above  /user  because  it  complicates   accessing  these  locations  from  the  Linux  command  line.  

Upload  >  Files  

Hue  allows  you  to  use  your  local  file  system  to  select  and  upload  files.   Note:  Avoid  uploading  zip  files  unless  you  are  familiar  with  uncompressing   these  files  from  inside  HDFS.  

Move  to  Trash  >   Delete  Forever  

“Trash”  is  just  another  directory  in  HDFS,  so  moving  files  to  trash  does  not   remove  them  from  HDFS.  

Loading  files  into  HDFS  from  a  command  line   Copying  files  to  HDFS  is  a  two-­‐step  process  requiring  an  SSH  connection:   1. Make  the  data  accessible  from  guest  machine.     There  are  several  ways  to  do  this:   • Use  an  SSH  client  such  as  FileZilla  or  CyberDuck.   • Use  secure  copy  (scp).   • Configure  a  shared  directory  in  the  VirtualBox  settings  for  the  VM.   2. From  inside  an  SSH  connection,  use  the  Hadoop  file  system  command   copyFromLocal  to  move  files  from  the  guest  file  system  into  HDFS.   The  following  steps  describe  using  scp  to  copy  files  into  the  guest.  Skip  to  step  7  if   you  chose  to  use  a  GUI  SSH  client  to  copy  the  files.  These  instructions  have  you  use   separate  terminal  windows  or  command  prompts  to  access  the  guest  machine  using   two  methods:   • •

(Guest)  indicates  the  terminal  window  or  command  prompt  with  an  open  SSH   connection.   (Host)  indicates  the  terminal  window  or  command  prompt  that  uses  scp   directly.  

To  copy  files  from  the  host  computer  to  HDFS  on  the  guest:   3. (Guest)  Open  an  SSH  connection  to  the  guest.   See  Accessing  the  Hadoop  cluster  using  SSH.  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

9    

Waterline  Data  Inventory  

 

Sandbox  for  HDP  2.1  and  VirtualBox  

4. (Guest)  Create  a  staging  location  for  your  data  on  the  guest  file  system.   The  SSH  connection  working  directory  is  /home/waterlinedata.  From  here,   create  a  directory  for  your  staged  data:   mkdir /data

5. (Guest)  If  needed,  create  HDFS  directories  into  which  you  will  copy  the  files.     Create  the  directories  using  Hue  or  using  the  following  command  inside  an  SSH   connection:   hadoop fs -mkdir

For  example,  to  create  a  new  directory  in  the  Landing  directory:   hadoop fs -mkdir /user/waterlinedata/NewStagingArea

6. (Host)  In  a  separate  terminal  window  or  command  prompt,  copy  directories  or   files  from  host  to  guest.   Navigate  to  the  location  of  the  data  to  copy  on  the  host  and  run  the  scp   command:   scp -r ./ waterlinedata@:

For  example  (all  on  one  line):   scp -r ./NewData waterlinedata@localhost:/home/waterlinedata/data -p2222 scp -r ./NewData [email protected]:/home/waterlinedata/data

7. (Guest)  Back  in  the  SSH  terminal  window  or  command  prompt,  copy  the  files   from  guest  file  system  to  the  cluster  using  the  HDFS  command  copyFromLocal.   Navigate  to  the  location  of  the  data  files  you  copied  in  step  6  and  copy  the  files   into  HDFS  using  the  following  command:     hadoop fs -copyFromLocal

For  example  (all  on  one  line):   hadoop fs -copyFromLocal /home/waterlinedata/data/ /user/waterlinedata/NewStagingArea/

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

10    

Waterline  Data  Inventory  

 

Sandbox  for  HDP  2.1  and  VirtualBox  

Running  Waterline  Data  Inventory  jobs   Waterline  Data  Inventory  format  discovery  and  profiling  jobs  are  MapReduce  jobs   run  in  Hadoop.  These  jobs  populate  the  Waterline  Data  Inventory  repository  with   file  format  and  schema  information,  sample  data,  and  data  quality  metrics  for  files  in   HDFS  and  Hive.    

  Tag  propagation,  lineage  discovery,  collection  discovery,  and  origin  propagation   jobs  are  jobs  run  on  the  edge  node  where  Waterline  Data  Inventory  is  installed.   These  jobs  use  data  from  the  repository  to  suggest  relationships  among  files,  to   suggest  additional  tag  associations,  and  to  propagate  origin  information.  

  Waterline  Data  Inventory  jobs  are  run  on  a  command  line  on  the  computer  on  which   Waterline  Data  Inventory  is  installed.  The  jobs  are  started  using  scripts  located  in   the  bin  subdirectory  in  the  installation  location.  For  the  VM,  the  installation  location   is  /home/waterlinedata/waterlinedata.     If  you  are  running  Waterline  Data  Inventory  jobs  in  a  development  environment,   consider  opening  two  separate  command  windows:  one  for  the  Jetty  console  output   and  a  second  to  run  Waterline  Data  Inventory  jobs.      

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

11    

Waterline  Data  Inventory   Command   Full  profiling  and  tag  propagation   bin/waterline profile

Profiling     bin/waterline profileOnly

   

 

Sandbox  for  HDP  2.1  and  VirtualBox   Description  

Performs  the  initial  profile  of  your  cluster  or  on   a  regular  interval  to  profile  new  and  updated   files.  This  command  triggers  profiling  as  well  as   the  discovery  processes  that  use  profiling  data.     Consider  running  the  lineage  discovery   command  after  this  command  completes.   You  can  choose  the  directory  to  profile  if  you   want  to  limit  the  scope  of  the  profiling  job.   Profiles  cluster  content.  Use  this  command   after  you’ve  added  files  to  the  cluster  but  you   aren’t  ready  to  have  Data  Inventory  suggest   tags  for  the  data.     Example:   bin/waterline profileOnly /user/Landing

Tag  Propagation   bin/waterline tag

Lineage  Discovery   bin/waterline runLineage

Propagates  tags  across  the  cluster.  Use  this   command  when  you  know  that  you  haven’t   added  new  files  but  you  have  tags  and  tag   associations  that  you  want  Data  Inventory  to   consider  for  propagation.   Discovers  lineage  relationships  and  propagates   origin  information.  Use  this  command  when   you  have  marked  folders  or  files  with  origin   labels  and  want  that  information  propagated   through  the  cluster.     Include  this  command  after  the  full  profile  for   regular  cluster  profiling.  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

12    

Waterline  Data  Inventory  

 

Sandbox  for  HDP  2.1  and  VirtualBox  

Monitoring  Waterline  Data  Inventory  jobs   Waterline  Data  Inventory  provides  a  record  of  job  history  in  the  Dashboard  of  the   browser  application.    

  In  addition,  you  can  follow  detailed  progress  of  each  job  on  the  console  where  you   run  the  command.  

 

Monitoring  Hadoop  jobs   When  you  run  the  “profile”  command,  you’ll  see  an  initial  job  for  format  discovery   followed  by  one  or  more  profiling  jobs.  There  will  be  at  least  one  profiling  job   running  in  parallel  for  each  file  type  Data  Inventory  identifies  in  the  format   discovery  pass.     The  console  output  includes  a  link  to  the  job  log  for  the  running  job.  For  example:   2014-09-20 18:17:27,048 INFO [WaterlineData Format Discovery Workflow V2] mapreduce.Job (Job.java:submit(1289)) - The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1913847052944_0004/

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

13    

Waterline  Data  Inventory  

 

Sandbox  for  HDP  2.1  and  VirtualBox  

While  the  job  is  running,  you  can  follow  this  link  to  see  the  progress  of  the   MapReduce  activity.     Alternatively,  you  can  monitor  the  progress  of  these  jobs  using  Hue  in  a  browser:   For  Hortonworks  distributions:   http://:8000/jobbrowser

You’ll  need  to  specify  the  waterlinedata  user.  

Monitoring  local  jobs     After  the  Hadoop  jobs  complete,  Waterline  Data  Inventory  runs  local  jobs  to  process   the  data  collected  in  the  repository.  You  can  follow  the  progress  of  these  jobs  by   watching  console  output  in  the  command  window  in  which  you  started  the  job.  

Debugging  information   There  are  multiple  sources  of  debugging  information  available  for  Data  Inventory.  If   you  encounter  a  problem,  collect  the  following  information  for  Waterline  Data   support.   •

Job  messages  on  the  console   Waterline  Data  Inventory  generates  console  output  for  jobs  run  at  the  command   prompt.  If  the  job  encounters  problems,  you  would  review  the  console  output  for   clues  to  the  problem.  To  report  errors  to  Waterline  Data  support,  you  would   copy  this  output  into  a  text  file  or  email  to  help  us  follow  what  occurred:   /home/waterlinedata/waterlinedata/bin/waterline profile

These  messages  appear  on  the  console  but  are  collected  in  a  log  file  with  debug   logging  level:   /var/log/waterline/wds-inventory.log



Web  server  console  output   The  embedded  web  server,  Jetty,  produces  output  corresponding  to  user   interactions  with  the  browser  application.  These  messages  appear  on  the   console  but  are  collected  in  a  log  file  with  debug  logging  level:   /var/log/waterline/waterlinedata.log

Use  tail  to  see  the  most  recent  entries  in  the  log:   tail -f /var/log/waterline/waterlinedata.log



Lucene  search  indexes   In  some  cases,  it  may  be  useful  to  examine  the  search  indexes  produced  by  the   product.  These  indexes  are  found  in  the  following  directory:   /opt/waterlinedata/var/index

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

14    

Waterline  Data  Inventory  



 

Sandbox  for  HDP  2.1  and  VirtualBox  

Waterline  Data  Inventory  repository   In  some  cases  it  may  be  useful  to  examine  the  actual  repository  files  produced  by   the  product.  The  repository  datastore  is  found  in  the  following  directory:   /var/lib/waterline/db/waterlinedatastore

Configuring  additional  Waterline  Data  Inventory  functionality   Waterline  Data  Inventory  provides  a  number  of  configuration  settings  and   integration  interfaces  to  enable  extended  functionality.  The  following  sections   describe  a  subset  of  those  properties  that  you  may  find  interesting  for  evaluating   the  product  in  the  VM  environment.  

Profiling  functionality   The  following  properties  control  how  Waterline  Data  Inventory  collects  data  from   HDFS  files.   Using  samples  to  calculate  data  metrics   By  default,  Waterline  Data  Inventory  uses  all  data  in  files  to  calculate  field-­‐level   metrics  such  as  the  minimum  and  maximum  values,  the  cardinality  and  density  of   the  values,  and  the  most  frequent  values.  You  can  achieve  better  profiling   performance  by  sampling  the  file  data  for  these  operations.  When  sampling  is   enabled,  Waterline  Data  Inventory  reads  the  first  and  last  blocks  in  the  file  and   enough  other  blocks  to  reach  the  sample  fraction  you  specify.  For  example,  with  a   sample  fraction  of  10%,  Waterline  Data  Inventory  will  read  6  blocks  of  a  250MB  file,   including  the  first  block,  the  last  block,  and  4  additional  blocks  chosen  at  random   (assuming  a  4096  KB  block  size).   [profiler.properties  file]   waterlinedata.profile.sampled=false (by default) waterlinedata.profile.sampled.fraction=0.1

(by default)

Re-­‐profiling  existing  files   By  default,  Waterline  Data  Inventory  only  profiles  new  files  or  files  that  have   changed  since  the  last  profiling  job.  Change  the  following  property  to  false  to   reprofile  all  files  in  the  target  directory.  You  might  choose  to  do  this  if  you  add  data   formats  or  change  other  parameters  that  affect  the  profiling  data  collected.     [profiler.properties  file]   waterlinedata.incremental=true (by default)

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

15    

Waterline  Data  Inventory  

 

Sandbox  for  HDP  2.1  and  VirtualBox  

Configuring  additional  date  formats   When  Waterline  Data  Inventory  profiles  string  data  such  as  in  delimited  files  where   no  type  information  is  available,  it  examines  the  data  to  reveal  likely  data  types.  It   uses  the  format  conventions  described  by  the  International  Components  for  Unicode   (ICU)  conventions  for  dates  and  numeric  values.  You  can  add  your  own  date  formats   using  the  conventions  described  here:   http://icu-­‐project.org/apiref/icu4j/com/ibm/icu/text/SimpleDateFormat.html   The  pre-­‐defined  formats  are  listed  in  the  profiler  properties  file.   [profiler.properties  file]   waterlinedata.profile.datetime.formats=EE MMM dd HH:mm:ss ZZZ yyyy, M/d/yy HH:mm, EEE MMM d h:m:s z yy, yy-MM-dd hh:mm:ss ZZZZZ, yy-MM-dd,yy-MM-dd HH:mm:ss,yy/M/dd,M/d/yy hh:mm:ss a, YYYY-MM-dd'T'HH:mm:ss.SSSSSSSxxx

Controlling  most-­‐frequent  data  values   Waterline  Data  Inventory  collects  750  of  the  most  frequent  values  in  each  field  in   each  file.  You  can  change  the  number  of  values  collected,  control  how  many   characters  are  included  in  each  sample,  and  how  many  of  these  values  are  used  in   search  indexes  and  to  propagate  tags.   Number  of  most-­‐frequent  values  collected   [profiler.properties  file]   waterlinedata.profile.top_k_capacity=2000 (by default)

Size  limit  of  strings   [profiler.properties  file]   waterlinedata.max.top_k_length=128 (by default)

Number  of  most-­‐frequent  values  used  in  search  indexes   [profiler.properties  file]   waterlinedata.profile.top_k=50 (by default)

Number  of  most-­‐frequent  values  used  to  determine  tag  association  matches   [profiler.properties  file]   waterlinedata.profile.top_k_tokens=100 (by default)

Number  of  the  most-­‐frequent  values  shown  in  the  user  interface  for  a  given  field   [profiler.properties  file]   waterlinedata.profile.top_k_capacity_tokens=750

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

16    

Waterline  Data  Inventory  

 

Sandbox  for  HDP  2.1  and  VirtualBox  

Hive  functionality   The  following  properties  control  interaction  with  Hive.  For  Hive  connection   information,  see  Communication  between  Waterline  Data  Inventory  and  Hive.   Hive  table  profiling   By  default,  Waterline  Data  Inventory  does  not  profile  Hive  tables:  from  the  Hive  root   in  the  browser  application,  users  will  see  Hive  tables,  but  schema-­‐level  details  for   the  tables  are  not  available.  To  include  Hive  tables  in  Waterline  Data  Inventory   profiling  jobs,  set  the  following  option  to  'true'.   [profiler.properties  file]   waterlinedata.profilehive=false (default)

Hive  table  creation   Waterline  Data  Inventory  allows  users  to  indicate  a  file  or  directory  to  use  as  the   source  to  create  a  Hive  table  in  the  Hive  data  store  associated  with  the  cluster.  In   addition,  you  can  enable  Waterline  Data  Inventory  to  create  Hive  tables  for  each  file   in  HDFS  at  the  time  Waterline  Data  Inventory  first  profiles  the  file.   Turning  on  this  option  has  a  large  performance  impact  on  profiling.   [profiler.properties  file]   waterlinedata.createhivetables=false (default)

Discovery  functionality   The  following  properties  control  how  Waterline  Data  Inventory  makes  suggestions   for  lineage  relationships  among  files  and  for  tag  associations.   Threshold  for  what  suggestions  are  exposed   Waterline  Data  Inventory  gives  a  weight  to  its  suggestions  for  matching  tag   associations.  You  can  choose  to  expose  more  or  fewer  of  these  suggestions  by   configuring  the  cutoff  weight.  Tag  associations  whose  calculated  weight  is  below   this  value  are  not  exposed  to  users.   [discovery.properties  file]   waterlinedata.discovery.tolerance.weight=40.0 (by default)

Limit  to  the  number  of  pre-­‐defined  tags  that  will  be  suggested  for  a  given  field   waterlinedata.discovery.tags.max_suggested_ref_tables=3

Limit  to  the  number  of  any  tags  that  will  be  suggested  for  a  given  field   waterlinedata.discovery.tags.max_suggested=3

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

17    

Waterline  Data  Inventory  

 

Sandbox  for  HDP  2.1  and  VirtualBox  

Eliminating  weak  associations.  If  more  than  one  tag  is  suggested  for  a  field,  the  tag   with  the  highest  weight  will  be  suggested;  other  tags  must  be  within  this  value  of  the   top  tag  for  those  tags  to  be  suggested  in  addition  to  the  top  tag.   waterlinedata.discovery.tags.value_hit_diff=20.0

Controlling  collections  discovery   By  default,  Waterline  Data  Inventory  only  considers  folders  with  3  or  more  files  in   any  one  folder  of  a  recursive  tree)  to  be  a  candidate  for  a  collection.  You  can  control   this  value  to  better  reflect  the  organization  of  your  cluster.  Note  that  there  are  other   qualifications  that  must  be  met  before  the  files  in  the  folder  are  marked  as  a   collection.   [discovery.properties  file]   waterlinedata.discovery.smallest.collection.size=3 (by default)

Controlling  lineage  relationship  discovery   When  reviewing  files  for  lineage  relationships,  Waterline  Data  Inventory  is  able  to   tolerate  a  number  of  changes  to  file  schemas  and  data  and  still  find  a  connection   among  files.  These  properties  control  the  parameters  used  to  determine  a  lineage   relationship.   The  amount  of  overlapping  data  between  fields  to  consider  the  files  matching.   waterlinedata.discovery.lineage.ovelap=0.9 (by default)

If  multiple  fields  from  the  same  resource  match  the  fields  from  another  resource,   Waterline  Data  Inventory  uses  field  names  to  determine  if  the  fields  match.  This   mechanism  is  used  only  if  field  names  are  similar  within  the  percentage  indicated  by   this  property,  0.8  (80%)  by  default.   waterlinedata.discovery.lineage.field_name_match=0.8

Use  HDFS  last  access  date  to  limit  lineage  relationship  candidates.  The  HDFS   property  dfs.namenode.accesstime.precision  in  hdfs-site.xml  must  be  enabled.   waterlinedata.discovery.lineage.use_access_time_filter=true

Limit  the  time  between  access  of  a  parent  file  and  creation  of  a  child.  This  criteria  is   ignored  (no  time  checking)  if  set  to  0.   waterlinedata.discovery.lineage.batch_window_hours=24

Accessing  Hive  tables   Waterline  Data  Inventory  makes  it  easy  to  create  Hive  tables  from  files  in  your   cluster.  You  can  access  the  Hive  instance  on  the  guest  through  Hue  or  by  connecting   to  Hive  from  other  third-­‐party  query  or  analysis  tools.  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

18    

Waterline  Data  Inventory  

 

Sandbox  for  HDP  2.1  and  VirtualBox  

Viewing  Hive  tables  in  Hue   You  can  access  the  Hive  tables  in  your  cluster  through  Hue  using  the  Beeswax  query   tool:   http://:8000/beeswax

Connecting  to  the  Hive  datastore   To  access  Hive  tables  from  Tableau  or  other  analysis  tool,  you’ll  need  to  configure  a   connection  to  the  Hive  datastore  on  the  cluster.  For  a  Waterline  Data-­‐supplied   cluster,  use  the  following  connection  information:    

Parameter  

Value  

Server  

Use  the  same  server  IP  address  as  you  use  for  Waterline  Data  Inventory  

Port  

10000

Server  Type  

HiveServer2  

Authentication  

Username  and  Password  

Username  

hive

Password  

hive  

 

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

19