Searching across the cluster

Report 1 Downloads 53 Views
       

Searching  across  the  cluster   Have  you  ever  seen  the  user  manual  for  Amazon's  shopping  interface?  Neither  have   we.  The  ease-­‐of-­‐use  of  the  consumer  shopping  site  inspired  us  when  we  created  the   search  interface  for  Waterline  Data  Inventory:  provide  search  options,  or  "facets,"   that  make  sense  for  the  kinds  of  items  you  are  looking  for  and  dynamically  update   the  facets  as  you  refine  your  search.  Waterline  Data  Inventory  provides  keyword   searching  and  pre-­‐defined  facets  for  file  and  field  properties.  In  addition,  you  can   make  facets  from  tags:  the  most  frequent  values  from  the  fields  with  those  tags   become  search  options.     This  tutorial  steps  you  through  the  search  capabilities  of  the  product  so  you  can   make  your  tagging  efforts  even  more  powerful.  It  includes  the  following  end-­‐user   tasks:   • • • • • •

Search  using  keywords   Search  using  facets   Search  from  the  Advanced  Search  page   Search  using  tags   Search  using  origins   Create  your  own  facets  

This  tutorial  refers  to  sample  data  pre-­‐loaded  in  the  Waterline  Data  Inventory  virtual  machine   images  and  cloud  sandboxes.  If  you  don't  already  have  access  to  one  of  these  evaluation  tools,   contact  [email protected].  

Search  using  keywords   Yes,  it's  that  easy:  enter  a  word,  partial  word,  or  phrase  in  the  search  box  at  the  top   of  the  Waterline  Data  Inventory.      

Type text here.

The  application  matches  your  text  against  text  in  the  profiling  data  that  Waterline   Data  Inventory  collects  when  profiling.  This  data  includes:   • • •

File  names  and  folder  names  (but  not  path  components)   Field  names   Data  values  from  the  top  50  most  frequent  values  in  each  field  

©  2014-­‐2015  Waterline  Data,  Inc.  All  rights  reserved.    

Waterline  Data  Inventory   • •

 

Searching  across  the  cluster  

Tag  names  and  descriptions   Origin  names  and  descriptions  

If  your  system  is  configured  to  profile  Hive  tables,  the  search  includes  the  same  kind   of  data  from  Hive  tables,  including  the  table  name.   To  illustrate  the  how  keyword  searches  work:   1. At  the  top  of  the  Waterline  Data  Inventory  screen  in  the  Global  Search  box,  enter   "industry".   The  file-­‐level  results  show  4  files:  

  A  quick  review  indicates  that  none  of  these  files  match  by  file  name,  none  have   tags,  and  none  of  the  sample  origins  include  the  word  "industry".  So  why  are   these  files  here?  These  files  are  here  because  they  have  fields  that  do  match  the   search  terms.   2. Click  Fields  to  shift  to  the  field-­‐level  search  results.  

  Ah,  now  it's  more  obvious:  four  items  clearly  match  by  field  name.  The  other   two?  Not  by  field  name  or  tag  name.  "Origin"  only  applies  to  the  file-­‐level,  so  the   match  must  be  on  tag  descriptions  or  data  values.  As  it  happens,  the  fields  only   have  one  tag  and  it  doesn't  have  the  word  "industry"  in  the  tag  description.  It   must  be  that  the  search  term  matches  on  field  data.  

©  2014-­‐2015  Waterline  Data,  Inc.  All  rights  reserved.  

2    

Waterline  Data  Inventory  

 

Searching  across  the  cluster  

3. Show  sample  values  for  the  field.   Select  in  the  row  of  one  of  the  two  user.description  fields:  not  the  field  name  or   the  containing  file  name,  but  elsewhere  in  the  row.  The  right-­‐pane  displays   information  about  that  field.  

  The  right  pane  shows  three  tabs  of  field-­‐level  information.  In  the  Values  tab  you   can  see  quickly  that  the  text  in  this  field  is  not   predictable  or  consistent  and  could  easily  match  the   search  text.   Select  the  other  of  the  two  user.description  fields.   Notice  that  the  Values  tab  shows  the  same  data.  Also,   the  field  names  in  the  file  are  the  same;  two  files  are   most  likely  copies  of  each  other.   4. Search  again,  this  time  on  "entertainment".   When  you  enter  new  text  in  the  search  box,  it  starts  a  new  search.     The  results  in  this  case  show  one  file  that  matches  on  file  name  and  three   additional  files  that  don't  have  an  obvious  match.    

©  2014-­‐2015  Waterline  Data,  Inc.  All  rights  reserved.  

3    

Waterline  Data  Inventory  

 

Searching  across  the  cluster  

  However,  when  we  look  at  the  field-­‐level  results,  we  see  all  of  the  fields  from   times_square_entertainment_venues.csv.    

  When  a  file  matches  the  search  criteria,  all  of  the  fields  in  the  file  show  in  the   search  results.  Just  like  the  file  shows  if  any  of  the  fields  match  the  search   criteria.  

   

©  2014-­‐2015  Waterline  Data,  Inc.  All  rights  reserved.  

4    

Waterline  Data  Inventory  

 

Searching  across  the  cluster  

Search  using  facets   The  left  pane  of  the  search  results  page  shows  you  facets  to  refine  your  search   results.  Each  facet  provides  a  list  of  items  or  ranges  that  include  the  values   represented  in  the  search  results.  For  example,  the  Content  Type  facet  can  include   any  type  of  file  Waterline  Data  Inventory  supports,  but  only  the  values  that  apply  to   the  current  search  results  show  in  the  list.    

               

 

Facet  values  in  the  left  pane  describe  the  content  of  the  search  results  

The  same  facets  appear  in  Advanced  Search  so  you  can  start  your  search  with  facet   values  selected.   To  refine  search  results  using  facets:   1. Search  for  "nyc".   This  search  returns  lots  of  results  in  the  sample  data  set,  almost  half  the  files  in   the  cluster.  Where  to  start!   2. In  the  left  pane,  look  for  facet  counts  that  can  help  you  refine  your  results.   Notice  that  the  facet  values  include  a  number  in  parentheses  after  the  value:   that's  the  number  of  items  in  the  search  results  that  match  this  facet  value.  As   you  make  choices  among  the  other  facets,  it  changes  to  reflect  the  content  in  the   middle  pane.  

©  2014-­‐2015  Waterline  Data,  Inc.  All  rights  reserved.  

5    

Waterline  Data  Inventory  

 

Searching  across  the  cluster  

  Here  are  some  examples  of  facet  choices  you  might  make  depending  on  what  you   are  looking  for:   •

In  the  facet  "US  State",  select  NEW  YORK.   This  choice  limits  the  search  results  to  fields  (and  their  containing  files)   tagged  with  the  US  State  tag  and  whose  values  include  "NEW  YORK".     Selecting  this  facet  isn't  the  same  as  a  keyword  search  on  the  text.  Instead  of   a  general  search  across  the  cluster,  selecting  the  facet  value:   • • •

Considers  only  the  contents  of  the  current  search  results,  not  the  whole   cluster  (left-­‐pane  vs.  Advanced  Search)   Considers  only  fields  or  files  that  are  tagged  with  the  US  State  tag.   Returns  only  fields  (and  their  containing  files)  that  include  the  specific   value  "NEW  YORK".  

You  might  search  this  way  when  you  know  you  want  New  York  state-­‐specific   results,  not  New  York  city  results  or  restaurants  named  "New  York  Pizza."   •

In  the  facet  "Origin",  select  "NYC  Open".   This  choice  limits  the  search  results  to  files  that  can  trace  their  lineage  back   to  the  "NYC  Open"  landing  folder  in  the  cluster.  The  field-­‐level  results  include   only  the  fields  from  these  files  that  were  already  in  the  original  search.   You  might  search  this  way  to  begin  limiting  an  overly  large  search  result  to   help  understand  the  results:  toggle  between  origin  values  so  you  see  which   files  came  from  where,  and  potentially,  which  files  include  data  from  more   than  one  origin.  

©  2014-­‐2015  Waterline  Data,  Inc.  All  rights  reserved.  

6    

Waterline  Data  Inventory   •

 

Searching  across  the  cluster  

In  the  facet  "Data  Field  Data  Range",  select  January  1,  2008  to  December  31,   2008.   This  choice  identifies  data  from  the  search  results  that  include  dates  that   could  potentially  overlap  with  this  date  range.  The  quality  of  the  results   depends  on  the  data:  if  you  have  a  missing  date  that  has  been  replaced  by   "01/01/1900"  and  the  rest  of  the  dates  are  later  than  2010,  the  file  qualifies   for  any  date  range  between  1900  and  the  most  recent  date  in  the  file.  

3. Try  selecting  multiple  values  in  the  same  facet.   You'll  notice  that  the  results  include  items  that  match  any  of  the  choices  (an  OR   relationship  among  search  criteria).   4. Try  selecting  values  in  multiple  facets.   You'll  notice  that  the  results  include  only  items  that  match  all  of  the  choices  (an   AND  relationship  among  search  criteria).   5. Include  a  keyword  filter.   The  top  of  the  left  pane  includes  a  text  box  where  you  can  filter  the  search   results  with  keywords.  This  box  applies  to  the  existing  search  results;  the  search   box  at  the  top  of  the  screen  starts  a  new  search.  

Search  from  the  Advanced  Search  page   All  of  the  detailed  facets  included  in  the  left  pane  are  also  available  in  the  Advanced   Search  page:  click  Advanced  Search  in  the  top  toolbar.     From  the  Advanced  Search  page,  you  see  all  of  the  facets  available  across  the  cluster   rather  than  only  the  facets  that  apply  to  the  current  search  results.  This  gives  you   freedom  to  identify  exactly  the  data  you  are  looking  for;  it  also  means  you  can  pick   combinations  that  don't  return  any  results.  Like  any  search,  if  you  aren't  finding   what  you  expect,  make  your  criteria  more  general.  Then  use  the  facets  in  the  left   pane  to  refine  the  results.  Use  the  facet  counts  as  clues  to  your  data!  

Search  using  tags   After  you've  tagged  fields  in  your  data  and  run  Waterline  Data  Inventory  profile   jobs,  searching  on  tags  can  be  a  very  powerful  way  to  explore  your  data.  For  details   on  tagging,  see  "Leveraging  tags  on  familiar  data."   To  search  using  tags:   1. Open  the  Advanced  Search  page.   2. In  Tags  and  Origins,  Tags  section,  type  "cuisine"  in  the  filter  and  choose  the   "Food  Service.Cuisine"  tag.  

©  2014-­‐2015  Waterline  Data,  Inc.  All  rights  reserved.  

7    

Waterline  Data  Inventory  

 

Searching  across  the  cluster  

  3. Click  Search.   4. Switch  to  the  Field  results.   In  the  sample  data,  this  tag  was  manually  applied  to  a  single  field;  it  will  be   propagated  to  other  fields,  but  not  to  files.   You  aren't  limited  to  using  tags  for  searching  from  the  Advanced  Search:  tags  can  be   useful  in  refining  search  results  as  well.  Look  for  the  Tags  facet  in  the  left  pane  and   for  the  tag  counts  provided  to  help  you  understand  your  search  results.  

   

Search  using  origins   Like  tags,  origins  can  be  very  powerful  for  identifying  important  files  in  search   results.  When  you  search  on  an  origin,  you  are  limiting  the  search  results  to  files   that  have  a  known  relationship  with  files  found  in  a  specific  landing  folder  in  the   cluster.  Typically,  the  landing  folder  represents  to  location  where  files  from  outside   Hadoop  arrived  in  the  cluster.  This  source  location  can  be  identified  and  controlled;   when  you  limit  your  search  to  these  files  and  files  derived  from  them,  you  have  a   tool  to  help  guarantee—or  at  least  trace—the  integrity  of  the  data.   For  more  information  about  origin  and  landing  folders,  see  the  tutorial  "Error!   Reference  source  not  found.."   ©  2014-­‐2015  Waterline  Data,  Inc.  All  rights  reserved.  

8    

Waterline  Data  Inventory  

 

Searching  across  the  cluster  

 

Create  your  own  facets   You  can  create  your  own  facets  to  help  search  in  your  data.  Waterline  Data   Inventory  lets  you  identify  a  tag  to  be  used  as  a  facet.  The  tag  becomes  the  facet  and   the  data  values  from  the  fields  associated  with  the  tag  become  the  selection  values.   For  example,  if  we  make  a  facet  out  of  the  tag  "Food  Service.Cuisine",  the  facet  would   be  "Food  Service.Cuisine"  and  the  values  would  be  "American",  "Bakery",  "Chinese",   "Diner",  and  so  on.     When  you  find  that  searching  on  specific  data  values  would  improve  your  searching,   think  about  what  tags  and  tag  associations  would  be  valuable.   Some  things  to  consider  when  you  are  selecting  tags  for  facets:   •

Representative  data,  not  exhaustive  lists.   The  data  used  to  create  facet  values  include  the  most  frequent  50  values  for  each   tagged  field.  If  you  have  very  large  files,  this  set  may  not  be  complete.  Will  it   confuse  users  to  not  show  every  possible  value?  



Unexpected  or  unrepresentative  values.   The  opposite  problem  is  too  many  values:  if  you  have  "bad"  data  in  files,  these   values  may  appear  in  the  list  of  facet  values.  For  example,  if  you  have  a  small  file   with  49  appropriate  values  and  one  garbage  value,  such  as  a  footer  in  the  file   that  didn't  parse  correctly,  that  value  appears  in  your  facet.  Often  the  garbage   value  shows  up  at  the  top  of  the  list  because  it  starts  with  punctuation  or  spaces.   Similarly,  you  probably  wouldn't  want  to  use  a  tag  associated  with  free-­‐text   fields  such  as  Twitter  text:  the  values  don't  show  all  the  values  in  the  file  and   potentially  show  inappropriate  content.  



Case-­‐sensitive  values.   Facet  values  are  taken  directly  from  the  data  in  the  files;  if  the  same  value   appears  in  field  data  as  both  lower  and  upper  case,  it  will  appear  twice  in  the   facet  list,  once  lower  case  and  once  upper  case.  Because  of  how  ASCII  text  is   sorted,  these  two  values  probably  won't  appear  next  to  each  other  in  the  list.  



Approved  tags.   Because  all  field  data  is  used  to  create  the  facet  values,  you'll  want  to  review  the   fields  associated  with  the  tag  to  make  sure  that  all  the  suggested  associations  are   accurate.    

Before  you  turn  all  your  tags  in  to  facets,  however,  there  is  a  performance  impact  to   additional  facets  in  the  search  index.  For  example,  additional  facets  require  more   time  to  generate  facet  values  for  the  Advanced  Search.  As  well,  you  may  see  a   performance  change  when  generating  facets  for  browse  views  of  large  directories.   Generally  speaking,  adding  a  facet  or  three  won't  be  noticeable;  adding  many  facets,   however,  can  distract  your  users  from  the  value  of  the  facets.  

©  2014-­‐2015  Waterline  Data,  Inc.  All  rights  reserved.  

9    

Waterline  Data  Inventory  

 

Searching  across  the  cluster  

To  create  a  custom  facet:   1. Click  Manage  in  the  top  toolbar.     2. From  the  left  tab  list,  open  the  Data  Facets  pane.   3. Search  for  the  Food  Service.Cuisine  tag.   For  example,  begin  typing  "cui"  in  the  search  box.  In  this  view,  the  tags  appear  in   their  hierarchy,  so  "Cuisine"  shows  up  as  one  of  the  tags  in  the  category  "Food   Service".  

  4. Click  Add  as  Data  Facets.   5. To  test  the  results,  open  the  Advanced  Search  page  and  select  "Tex-­‐Mex"  from   the  Food  Service.Cuisine  facet  list.   You  may  need  to  search  or  scroll  down  in  the  list  to  find  "Tex-­‐Mex".  

  6. Click  Search.   If  you  are  like  the  staff  at  Waterline  Data  Science,  we  expect  that  you'll  be  finding   data  you  didn't  know  you  had  and  values  in  data  that  you  don't  necessarily  want.   We'd  like  to  hear  your  ideas  on  how  to  make  search  as  useful  as  possible  without   focusing  users  on  problem  data  instead  of  the  valuable  data.   ©  2014-­‐2015  Waterline  Data,  Inc.  All  rights  reserved.  

10    

Waterline  Data  Inventory  

 

Searching  across  the  cluster  

Searching  with  keywords  and  facets  give  you  swift  access  to  incredible  detail,   without  coding  and  without  waiting  to  load  entire  files.  Now  you  can  experiment   and  refine  results  without  consequence.  Search  effectively!  

©  2014-­‐2015  Waterline  Data,  Inc.  All  rights  reserved.  

11