How to Hadoop effortlessly with Waterline Data Inventory Waterline Data Inventory gives users of Hadoop data a wealth of information for files, table, and fields to help them identify just the right data. It provides tools to help describe and easily return to identified data. How do you get the Waterline advantage? It's a combination of administrative setup and user actions: administrators run jobs on the cluster and users view and annotate files through a web browser application. To help you understand the basics of how Waterline Data Inventory works, we've installed our product in some of the popular Hadoop evaluation sandboxes. We've included sample data, including copies and extracts of the files like you might find in a working cluster. This document describes what you would do to set-‐up and run Waterline Data Inventory on your cluster, using the evaluation sandbox as an example. This tutorial refers to sample data pre-‐loaded in the Waterline Data Inventory virtual machine images, available from go.waterlinedata.com/download-‐sandbox.
The tasks involved in running and managing Waterline Data Inventory for your cluster are described in these sections, divided into tasks that are run in the Waterline Data Inventory user interface and tasks that are run on the Waterline Data Inventory server: • • • • • •
(Server) First things first: profile your data (UI and Server) Mark landings and run lineage discovery (UI and Server) Tag the data you know (UI) Leverage discovery results in searches (UI) Bookmark files you want to follow (Server) Run jobs to keep up with new data and users' tags
First things first: profile your data For Waterline Data Inventory to give users the rich data-‐browsing experience they'll want, an administrator runs "profiling" jobs on the server where Waterline Data Inventory is installed. Typically, installation happens on an edge node in your Hadoop cluster. We've done that work for you in the evaluation images; to see how it's done for an existing cluster, see Waterline Data Inventory Installation and Administration Guide. © 2014-‐2015 Waterline Data, Inc. All rights reserved.
Waterline Data Inventory
How to Hadoop effortlessly with Waterline Data Inventory
Profiling jobs include several separate processes: • • •
Crawling HDFS files to determine each file's format. Reading each HDFS file to extract field-‐metadata and data-‐quality metrics. Inserting the metadata and data into Waterline Data Inventory's repository.
Profiling jobs also include some discover tasks, which you can also run independently from the profiling jobs: •
• •
Using the repository data to suggest tags on field data, based both on pre-‐ determined reference data (such as country names or zip codes) and data previously tagged by users (such as product codes or sales regions). Using the repository data to find collections of data to treat as "partitions" or "snapshots". (More about collections later.) Using repository data to find files that contain the same data
When you first install Waterline Data Inventory, you'll need to run profiling on every file in the cluster. The initial profile run will take some time; we recommend you break up the profiling into sections. The profiling job takes one or more directories as its argument and profiles their contents recursively. You can start with a smaller amount of data to start, then profile the cluster section-‐by-‐section. For the sandbox sample data, to profile sections of data at a time, an administrator runs these profiling jobs from the installation directory, waterlinedata:
Profile Sherlock directory:
bin/waterline profile /user/waterlinedata/Sherlock
Profile only data.gov directory in Landing:
bin/waterline profile /user/waterlinedata/Landing/data.gov
Profile only nyc_open directory in Landing:
bin/waterline profile /user/waterlinedata/Landing/nyc_open
Profile the rest of Landing:
bin/waterline profile /user/waterlinedata/Landing
You can also choose to run the discovery tasks separately from the profiling tasks. After profiling, Waterline Data Inventory shows the rich metadata and sample data for each field in each file in the cluster.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
2
Waterline Data Inventory
How to Hadoop effortlessly with Waterline Data Inventory
Profiled files show data quality metrics and sample data for each field
Mark landings and run lineage discovery A key feature of Waterline Data Inventory is its ability to discover and display relationships among files, such as files that are duplicates of each other or files that contain copies of data from other files. When it discovers such a relationship, Waterline Data Inventory shows the lineage of the files with the older file as the parent of the newer file. These lineage chains are very powerful when combined with the idea of a "landing," or the original location at which the file arrived in the cluster. The landing label is propagated through the lineage chain so each file derived from the original file shows the same origin. You'll know where your data came from, even if you are working on your third iteration of the original file. To discover lineage relationships and to propagate the landing labels or "origins" along the lineage chain, an administrator runs a lineage job. The lineage discovery process uses data from the Waterline Data Inventory repository to determine relationships among files and the chronology of the file evolution.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
3
Waterline Data Inventory
How to Hadoop effortlessly with Waterline Data Inventory
For the sandbox sample data (or any cluster), an administrator runs the lineage job once for all files already profiled. From the installation directory, waterlinedata: bin/waterline runLineage
The lineage and origins appear for each file in the cluster. If Waterline Data Inventory suggests a lineage relationship that isn't accurate or doesn't capture the path you want to describe, you can reject the suggested relationships and the same lineage won't be suggested for that file in future lineage runs. In addition, you can manually build relationships by adding a parent to a given file's lineage picture.
Lineage relationships and origins show in the Lineage tab for a file
The tutorial "Tracing file sources" walks through examples of lineage display, searching using origins, and running lineage discovery jobs.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
4
Waterline Data Inventory
How to Hadoop effortlessly with Waterline Data Inventory
Tag the data you know Now that users can see the wealth of file and field information, they can begin to annotate the data using "tags." Tags give users a place to record knowledge about files and fields so other users have the benefit of that knowledge. In addition, Waterline Data Inventory uses the data in tagged fields to identify other data in the cluster that may be similar. For example, if you add the tag "product ID" to a field called PROD_ID, the next time Waterline Data Inventory tag propagation job runs, it will find other fields in the cluster with a similar field name and data pattern to the PROD_ID field. When it finds similar fields, it suggests the product ID tag to those fields. Now a search returns all the similar fields, not just the originally tagged field. In Waterline Data Inventory, you can tag folders, files, and fields. Click Add Tag or click the tag count button to open the tagging dialog box; there you can enter a new tag or quickly find an existing tag.
Click the tag count to manage tags for this field
If you have data in a field that has a specific pattern or can otherwise be described using a minimum value, a maximum value, and a regular expression, Waterline Data
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
5
Waterline Data Inventory
How to Hadoop effortlessly with Waterline Data Inventory
Inventory lets you specify that pattern as a tagging rule that will be applied to profiled data in place of the built-‐in tag discovery process. After you have added tags, an administrator can run the tag propagation job on the server to have Waterline Data Inventory identify additional locations in the cluster with data that matches the tagged data. For the sandbox sample data (or any cluster), an administrator runs the tag propagation job once for all files already profiled. From the installation directory, waterlinedata: bin/waterline tag
The suggested tag associations appear for each field in the cluster. One particularly strong method of using the suggested tags is in searches, which the next section describes. The tutorial "Leveraging tagging of familiar data" walks through examples of tagging and running tag propagation jobs.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
6
Waterline Data Inventory
How to Hadoop effortlessly with Waterline Data Inventory
Leverage discovery results in searches Even before users tag files and fields and before tag propagation identifies related data, you have access to search parameters that let you see into the details of files and fields, including field names, the most frequent values in the fields, and data quality metrics such as field value density and cardinality. Add Waterline Data Inventory's tagging and lineage features and you have tremendous power to identify data across the cluster. For example, in the sandbox sample data, go to Advanced Search and find the tag "Cuisine". Typing a few letters in the Tags filter box brings up that tag, which is nested under "Food Service." Select the tag and click Search.
In Advanced Search, Tags section, type a few letters of the tag name to filter the list of tags
The search results show the files that include fields tagged with the "Food Service.Cuisine" tag. Switch to the Fields view and you'll see the individual fields tagged with this tag. Add the "Tags" field to the list of columns and you can see the weight Waterline Data Inventory gave each association.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
7
Waterline Data Inventory
How to Hadoop effortlessly with Waterline Data Inventory
Field search results configured to show the tags with suggested tags' weights
If Waterline Data Inventory suggests a tag association that isn't accurate, you can reject the suggested tag (click the tag count to open the tagging dialog box) and the same tag won't be suggested again for that field. The tutorial "Searching across the cluster" walks through examples of searching using keywords and facets.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
8
Waterline Data Inventory
How to Hadoop effortlessly with Waterline Data Inventory
Bookmark files you want to follow Find a file you expect to return to? Want to know if this file changes or if a coworker has added tags to it? Bookmarking a file or folder allows you to jump right to the file from the Bookmark menu on the top of the Waterline Data Inventory screen.
After you bookmark a file, it appears in the menu on the toolbar
In addition, Waterline Data Inventory collects notifications on files, folders, and tags you've bookmarked. All items in your bookmark list are tracked in your notifications. Notifications are displayed when a new tag is added to a folder, file, or field in a file; when the file is updated; when a file or folder is marked as a landing point; and when users generate a Hive table for the file. Click in the toolbar to open a short list of notifications; click See All in that menu to open a full list of notifications.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
9
Waterline Data Inventory
How to Hadoop effortlessly with Waterline Data Inventory
Notifications show events that happen on items in your Bookmark list
Run jobs to keep up with new data and users' tags As new data comes into your cluster, you'll want to run Waterline Data Inventory profiling jobs to make the rich metadata for the new data available to users. In addition, you'll want to run tagging jobs to make sure that tags added to fields are propagated to new and to existing data that matches the tagged data. Determine how often to run profiling jobs based on the amount of new data that comes into the cluster. Because you can run the jobs on a section of the cluster (by specifying one directory or one place in the hierarchy of directories), you can balance how much time is devoted to profiling for a given job. To keep up-‐to-‐date with incoming data , run both profiling and lineage jobs. For example, consider running one or more profiling jobs with tagging, then lineage for the cluster. In the sample cluster, you might run profiling on a heavily used landing directory independently, then on the remaining landing directories: bin/waterline profile /user/waterlinedata/Landing/data.gov bin/waterline profile /user/waterlinedata/Landing bin/waterline runLineaage
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
10
Waterline Data Inventory
How to Hadoop effortlessly with Waterline Data Inventory
The Waterline Data Inventory sandbox is pre-‐populated with data for you to explore. The HDFS files have been profiled and the data has some field and file tags applied and propagated through-‐out the cluster. The following tutorials walk you through specific exercises that further augment the sample data so you can experience the value that Waterline Data Inventory provides. • • •
Leveraging tagging of familiar data Searching across the cluster Tracing file sources
These tutorials are available with the sandbox images at go.waterlinedata.com/download-‐sandbox.
© 2014-‐2015 Waterline Data, Inc. All rights reserved.
11