Searching across the cluster Have you ever seen the user manual for Amazon's shopping interface? Neither have we. The ease-‐of-‐use of the consumer shopping site inspired us when we created the search interface for Waterline Data Inventory: provide search options, or "facets," that make sense for the kinds of items you are looking for and dynamically update the facets as you refine your search. Waterline Data Inventory provides keyword searching and pre-‐defined facets for file and field properties. In addition, you can make facets from tags: the most frequent values from the fields with those tags become search options. This tutorial steps you through the search capabilities of the product so you can make your tagging efforts even more powerful. It includes the following end-‐user tasks: • • • • • •
Search using keywords Search using facets Search from the Advanced Search page Search using tags Search using origins Create your own facets
This tutorial refers to sample data pre-‐loaded in the Waterline Data Inventory virtual machine images and cloud sandboxes. If you don't already have access to one of these evaluation tools, contact
[email protected].
Search using keywords Yes, it's that easy: enter a word, partial word, or phrase in the search box at the top of the Waterline Data Inventory.
Type text here.
The application matches your text against text in the profiling data that Waterline Data Inventory collects when profiling. This data includes: • • •
File names and folder names (but not path components) Field names Data values from the top 50 most frequent values in each field
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
Waterline Data Inventory • •
Searching across the cluster
Tag names and descriptions Origin names and descriptions
If your system is configured to profile Hive tables, the search includes the same kind of data from Hive tables, including the table name. To illustrate the how keyword searches work: 1. At the top of the Waterline Data Inventory screen in the Global Search box, enter "industry". The file-‐level results show 4 files:
A quick review indicates that none of these files match by file name, none have tags, and none of the sample origins include the word "industry". So why are these files here? These files are here because they have fields that do match the search terms. 2. Click Fields to shift to the field-‐level search results.
Ah, now it's more obvious: four items clearly match by field name. The other two? Not by field name or tag name. "Origin" only applies to the file-‐level, so the match must be on tag descriptions or data values. As it happens, the fields only have one tag and it doesn't have the word "industry" in the tag description. It must be that the search term matches on field data.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
2
Waterline Data Inventory
Searching across the cluster
3. Show sample values for the field. Select in the row of one of the two user.description fields: not the field name or the containing file name, but elsewhere in the row. The right-‐pane displays information about that field.
The right pane shows three tabs of field-‐level information. In the Values tab you can see quickly that the text in this field is not predictable or consistent and could easily match the search text. Select the other of the two user.description fields. Notice that the Values tab shows the same data. Also, the field names in the file are the same; two files are most likely copies of each other. 4. Search again, this time on "entertainment". When you enter new text in the search box, it starts a new search. The results in this case show one file that matches on file name and three additional files that don't have an obvious match.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
3
Waterline Data Inventory
Searching across the cluster
However, when we look at the field-‐level results, we see all of the fields from times_square_entertainment_venues.csv.
When a file matches the search criteria, all of the fields in the file show in the search results. Just like the file shows if any of the fields match the search criteria.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
4
Waterline Data Inventory
Searching across the cluster
Search using facets The left pane of the search results page shows you facets to refine your search results. Each facet provides a list of items or ranges that include the values represented in the search results. For example, the Content Type facet can include any type of file Waterline Data Inventory supports, but only the values that apply to the current search results show in the list.
Facet values in the left pane describe the content of the search results
The same facets appear in Advanced Search so you can start your search with facet values selected. To refine search results using facets: 1. Search for "nyc". This search returns lots of results in the sample data set, almost half the files in the cluster. Where to start! 2. In the left pane, look for facet counts that can help you refine your results. Notice that the facet values include a number in parentheses after the value: that's the number of items in the search results that match this facet value. As you make choices among the other facets, it changes to reflect the content in the middle pane.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
5
Waterline Data Inventory
Searching across the cluster
Here are some examples of facet choices you might make depending on what you are looking for: •
In the facet "US State", select NEW YORK. This choice limits the search results to fields (and their containing files) tagged with the US State tag and whose values include "NEW YORK". Selecting this facet isn't the same as a keyword search on the text. Instead of a general search across the cluster, selecting the facet value: • • •
Considers only the contents of the current search results, not the whole cluster (left-‐pane vs. Advanced Search) Considers only fields or files that are tagged with the US State tag. Returns only fields (and their containing files) that include the specific value "NEW YORK".
You might search this way when you know you want New York state-‐specific results, not New York city results or restaurants named "New York Pizza." •
In the facet "Origin", select "NYC Open". This choice limits the search results to files that can trace their lineage back to the "NYC Open" landing folder in the cluster. The field-‐level results include only the fields from these files that were already in the original search. You might search this way to begin limiting an overly large search result to help understand the results: toggle between origin values so you see which files came from where, and potentially, which files include data from more than one origin.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
6
Waterline Data Inventory •
Searching across the cluster
In the facet "Data Field Data Range", select January 1, 2008 to December 31, 2008. This choice identifies data from the search results that include dates that could potentially overlap with this date range. The quality of the results depends on the data: if you have a missing date that has been replaced by "01/01/1900" and the rest of the dates are later than 2010, the file qualifies for any date range between 1900 and the most recent date in the file.
3. Try selecting multiple values in the same facet. You'll notice that the results include items that match any of the choices (an OR relationship among search criteria). 4. Try selecting values in multiple facets. You'll notice that the results include only items that match all of the choices (an AND relationship among search criteria). 5. Include a keyword filter. The top of the left pane includes a text box where you can filter the search results with keywords. This box applies to the existing search results; the search box at the top of the screen starts a new search.
Search from the Advanced Search page All of the detailed facets included in the left pane are also available in the Advanced Search page: click Advanced Search in the top toolbar. From the Advanced Search page, you see all of the facets available across the cluster rather than only the facets that apply to the current search results. This gives you freedom to identify exactly the data you are looking for; it also means you can pick combinations that don't return any results. Like any search, if you aren't finding what you expect, make your criteria more general. Then use the facets in the left pane to refine the results. Use the facet counts as clues to your data!
Search using tags After you've tagged fields in your data and run Waterline Data Inventory profile jobs, searching on tags can be a very powerful way to explore your data. For details on tagging, see "Leveraging tags on familiar data." To search using tags: 1. Open the Advanced Search page. 2. In Tags and Origins, Tags section, type "cuisine" in the filter and choose the "Food Service.Cuisine" tag.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
7
Waterline Data Inventory
Searching across the cluster
3. Click Search. 4. Switch to the Field results. In the sample data, this tag was manually applied to a single field; it will be propagated to other fields, but not to files. You aren't limited to using tags for searching from the Advanced Search: tags can be useful in refining search results as well. Look for the Tags facet in the left pane and for the tag counts provided to help you understand your search results.
Search using origins Like tags, origins can be very powerful for identifying important files in search results. When you search on an origin, you are limiting the search results to files that have a known relationship with files found in a specific landing folder in the cluster. Typically, the landing folder represents to location where files from outside Hadoop arrived in the cluster. This source location can be identified and controlled; when you limit your search to these files and files derived from them, you have a tool to help guarantee—or at least trace—the integrity of the data. For more information about origin and landing folders, see the tutorial "Error! Reference source not found.." © 2014-‐2015 Waterline Data, Inc. All rights reserved.
8
Waterline Data Inventory
Searching across the cluster
Create your own facets You can create your own facets to help search in your data. Waterline Data Inventory lets you identify a tag to be used as a facet. The tag becomes the facet and the data values from the fields associated with the tag become the selection values. For example, if we make a facet out of the tag "Food Service.Cuisine", the facet would be "Food Service.Cuisine" and the values would be "American", "Bakery", "Chinese", "Diner", and so on. When you find that searching on specific data values would improve your searching, think about what tags and tag associations would be valuable. Some things to consider when you are selecting tags for facets: •
Representative data, not exhaustive lists. The data used to create facet values include the most frequent 50 values for each tagged field. If you have very large files, this set may not be complete. Will it confuse users to not show every possible value?
•
Unexpected or unrepresentative values. The opposite problem is too many values: if you have "bad" data in files, these values may appear in the list of facet values. For example, if you have a small file with 49 appropriate values and one garbage value, such as a footer in the file that didn't parse correctly, that value appears in your facet. Often the garbage value shows up at the top of the list because it starts with punctuation or spaces. Similarly, you probably wouldn't want to use a tag associated with free-‐text fields such as Twitter text: the values don't show all the values in the file and potentially show inappropriate content.
•
Case-‐sensitive values. Facet values are taken directly from the data in the files; if the same value appears in field data as both lower and upper case, it will appear twice in the facet list, once lower case and once upper case. Because of how ASCII text is sorted, these two values probably won't appear next to each other in the list.
•
Approved tags. Because all field data is used to create the facet values, you'll want to review the fields associated with the tag to make sure that all the suggested associations are accurate.
Before you turn all your tags in to facets, however, there is a performance impact to additional facets in the search index. For example, additional facets require more time to generate facet values for the Advanced Search. As well, you may see a performance change when generating facets for browse views of large directories. Generally speaking, adding a facet or three won't be noticeable; adding many facets, however, can distract your users from the value of the facets.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
9
Waterline Data Inventory
Searching across the cluster
To create a custom facet: 1. Click Manage in the top toolbar. 2. From the left tab list, open the Data Facets pane. 3. Search for the Food Service.Cuisine tag. For example, begin typing "cui" in the search box. In this view, the tags appear in their hierarchy, so "Cuisine" shows up as one of the tags in the category "Food Service".
4. Click Add as Data Facets. 5. To test the results, open the Advanced Search page and select "Tex-‐Mex" from the Food Service.Cuisine facet list. You may need to search or scroll down in the list to find "Tex-‐Mex".
6. Click Search. If you are like the staff at Waterline Data Science, we expect that you'll be finding data you didn't know you had and values in data that you don't necessarily want. We'd like to hear your ideas on how to make search as useful as possible without focusing users on problem data instead of the valuable data. © 2014-‐2015 Waterline Data, Inc. All rights reserved.
10
Waterline Data Inventory
Searching across the cluster
Searching with keywords and facets give you swift access to incredible detail, without coding and without waiting to load entire files. Now you can experiment and refine results without consequence. Search effectively!
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
11