Waterline Data Inventory Sandbox for CDH 5.3 and VirtualBox Product Version 1.2.0 Document Version 1.2.0
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved. All other trademarks are the property of their respective owners.
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
Table of Contents Overview ..................................................................................................................................... 3 Related Documents ................................................................................................................. 3 System requirements ............................................................................................................. 3 Setting up Waterline Data Inventory VM sandbox for VirtualBox .......................... 4 Running Waterline Data Inventory ............................................................................................. 5 Opening Waterline Data Inventory in a browser ................................................................... 5 Exploring the sample cluster ......................................................................................................... 6 Shutting down the cluster .............................................................................................................. 8 Loading data into HDFS ......................................................................................................... 8 Using Hue to load files into HDFS ................................................................................................. 9 Loading files into HDFS from a command line ........................................................................ 9 Running Waterline Data Inventory jobs ....................................................................... 11 Monitoring Waterline Data Inventory jobs ................................................................. 13 Monitoring Hadoop jobs .............................................................................................................. 13 Monitoring local jobs .................................................................................................................... 14 Debugging information ................................................................................................................ 14 Configuring additional Waterline Data Inventory functionality .......................... 15 Profiling functionality .................................................................................................................. 15 Hive functionality ........................................................................................................................... 17 Discovery functionality ................................................................................................................ 17 Accessing Hive tables .......................................................................................................... 18 Viewing Hive tables in Hue ......................................................................................................... 19 Connecting to the Hive datastore .............................................................................................. 19
2
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
Overview Waterline Data Inventory reveals information about the metadata and data quality of files in a Apache™ Hadoop® cluster so the users of the data can identify the files they need for analysis and downstream processing. The application installs on an edge node in the cluster and runs MapReduce jobs to collect data and metadata from files in HDFS and Hive. It then discovers relationships and patterns in the profiled data and stores the results in its metadata repository. A browser application lets users search, browse, and tag HDFS files and Hive tables using the benefits of the collected metadata and Data Inventory’s discovered relationships. This document describes setting up a virtual machine image that is pre-‐configured with the Waterline Data Inventory application and sample cluster data. The image is built from Cloudera CDH 5.3 sandbox on Oracle® VirtualBox™.
Related Documents •
Waterline Data Inventory User Guide (also available from the browser application)
menu in the
For the most recent documentation and product tutorials, see www.waterlinedata.com/downloads.
System requirements Waterline Data Inventory sandbox is available inside the Cloudera CDH 5.3 sandbox. See the system requirements here: http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-‐ 5-‐3-‐x.html The Waterline Data Inventory sandbox is configured with 10 GB of physical RAM rather than the default of 4 GB. The basic requirements are as follows For your host computer: •
64-‐bit computer that supports virtualization. VirtualBox describes the unlikely cases where your hardware may not be compatible with 64-‐bit virtualization: www.virtualbox.org/manual/ch10.html -‐ hwvirt
•
Operating system supported by VirtualBox, including Microsoft® Windows® (XP and later), many Linux distributions, Apple® Mac® OS X, Oracle Solaris®, and OpenSolaris™. www.virtualbox.org/wiki/End-‐user_documentation
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
3
Waterline Data Inventory • •
Sandbox for CDH 5.3 and VirtualBox
At least 10 GB of RAM VirtualBox virtualization application for your operating system. Download the latest version from here: www.virtualbox.org
•
Waterline Data Inventory VM image built on Cloudera CDH 5.3 sandbox, VirtualBox version. www.waterlinedata.com/download
Browser compatibility • • • •
Microsoft Internet Explorer 10 and later (not supported on Mac OS) Chrome 36 or later Safari 6 or later Firefox 31 or later
Setting up Waterline Data Inventory VM sandbox for VirtualBox 1. Install VirtualBox. 2. Download the Waterline Data Inventory VM (.ova file) 3. Open the .ova file with VirtualBox (double-‐click the file). 4. Click Import to accept the default settings for the VM. This will take a few minutes to expand the archive and create the guest environment. 5. (optional) Configure a way to move files between the host and guest. Because the Cloudera sandbox provides a graphical user interface, this problem may be solved using the file navigation tools provided. Some additional options are: • Configure a shared directory between the host and guest. (Settings > Shared Folders, specify auto-‐mount. From the guest computer, you can access the shared folder at /media/sf_<shared folder name>) • Setup a bi-‐directional clipboard. (Devices > Drag-‐n-‐Drop > Bidirectional) • Configure an SSH connection. 6. Start the VM. It will take a few minutes for Hadoop and its components startup. 7. The VM comes up; you are already logged in as cloudera/cloudera.
4
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
Running Waterline Data Inventory 1. Open a terminal and switch to the Waterline Data Inventory dedicated user (waterlinedata/waterlinedata). su waterlinedata
Enter the password when prompted (waterlinedata). 2. Start the embedded metadata repository database, Derby. cd /opt/waterlinedata bin/derbyStart
Type Enter to return to the shell prompt. You'll see a response that ends with "...started and ready to accept connections on port 4444". If you see "... Address already in use," try to start Derby again. (The Quickstart image has a non-‐ running process that is configured to use port 4444; eventually the system will allow the running process to listen on that port.) 3. Start the embedded web server, Jetty. bin/jettyStart
The console fills with status messages from Jetty. Only messages identified by "ERROR" or "exception" indicate problems, even though some of the INFO and WARN messages may seem like they indicate problems! You are now ready to use the application and its sample data.
Opening Waterline Data Inventory in a browser The sandbox includes pre-‐profiled data so you can see the functionality of Waterline Data Inventory before you load your own data. 1. Open a browser (from the host or guest computer) to the Waterline Data Inventory application: http://localhost:8082
2. Sign into Waterline Data Inventory as waterlinedata/waterlinedata. 3. The VM image is configured with the following additional ports that allow access to the guest:
Port
Application Component
8082 10000 19888 4444 8888
Waterline Data Inventory browser application Hive Hadoop job history Derby Hue
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
5
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
Exploring the sample cluster The Waterline Data Inventory sandbox is pre-‐populated with public data to simulate a set of users analyzing and manipulating the data. As you might expect among a group of users, there are multiple copies of the same data, standards for file and field names are not consistent, and data is not always wrangled into forms that are immediately useful for analysis. In other words, the data is intended to reflect reality. Here are some entry points to help you use this sample data to explore the capabilities of Waterline Data Inventory: Tags Tags help you identify data that you may want to use for analysis. When you place tags on fields, Waterline Data Inventory looks for similar data across the profiled files in the cluster and suggests your tags for other fields. Use the tags you enter and automatically suggested tags in searches and search filtering with facets. In the sample data, look for tags for "Food Service" data.
6
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
Lineage relationships, landings, and origins Waterline Data Inventory uses file metadata and data to identify cluster files that are related to each other. It finds copies of the same data, joins between files, and horizontal and vertical subsets of files. If you mark the places where data comes into the cluster with "Landing" labels, Waterline Data Inventory propagates this information through the lineage relationships to show the origin of the data. In the sample data, look for origins for "data.gov," "Twitter," and "Restaurant Inspections."
Note that lineage icons may not display properly in the version of Firefox provided in the Cloudera QuickStart VM image. Use a supported browser on your host computer if you encounter this problem. Searching with facets Use the Global Search text box on the top of the page to do keyword searches across your cluster metadata, including searching on file and field names, tags and tag descriptions, 50 examples of the most frequent data in each field. Waterline Data Inventory also provides search facets on common file and field properties, such as file size and data density. Some of the most powerful facets are those for tags and origins. Use the facet lists on the Advance Search page to identify what kind of data you want to find. Then use facets in the left pane to refine the search results further. In the sample data, use "Food Service" tags in the Advance Search page, then filter the results by origin, such as "Restaurant Inspections".
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
7
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
Shutting down the cluster To make sure you can restart the cluster cleanly, follow these steps to shut it down: 1. In a terminal window on the guest (or your SSH connection), stop the Jetty web server and the Derby repository: /opt/waterlinedata/bin/jettyStop /opt/waterlinedata/bin/derbyStop
2. Shut down the cluster. Choose Virtual Machine > Shut Down. If you don't see this option, press the Option key while opening the menu.
Loading data into HDFS Loading data into HDFS is a two stage process: first you load data from its source— such as your local computer or a public website—to the guest file system. Then you copy the data from the guest file system into HDFS. For a small number of files, the Hadoop utility Hue makes this process very easy by allowing you to select files from the host computer and copy them directly into HDFS. For larger files or large numbers of files, you may decide to use a combination of a shared directory (to allow access to the files from the guest machine) and a command-‐line operation (to move files from the guest file system to HDFS).
8
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
Using Hue to load files into HDFS To access Hue from a browser on the host computer: http://: Directory
Create a new directory inside the current directory. Feel free to create additional /user directories. Note: Avoid adding directories above /user because it complicates accessing these locations from the Linux command line.
Upload > Files
Hue allows you to use your local file system to select and upload files. Note: Avoid uploading zip files unless you are familiar with uncompressing these files from inside HDFS.
Move to Trash > Delete Forever
“Trash” is just another directory in HDFS, so moving files to trash does not remove them from HDFS.
Loading files into HDFS from a command line Copying files to HDFS is a two-‐step process requiring an SSH connection: 1. Make the data accessible from guest machine. There are several ways to do this: • Configure a shared directory in the VirtualBox settings for the VM. • Use an SSH client such as FileZilla or CyberDuck to copy the files to the guest. • Use secure copy (scp) to copy the files to the guest. 2. From a terminal, use the Hadoop file system command copyFromLocal to move files from the guest file system into HDFS. To copy files from the host computer to HDFS on the guest: 1. If needed, create HDFS directories into which you will copy the files. Create the directories using Hue or using the following command inside the terminal: hadoop fs -mkdir
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
9
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
For example, to create a new directory in the Landing directory: hadoop fs -mkdir /user/waterlinedata/NewStagingArea
2. Copy the files from guest file system to the cluster using the HDFS command copyFromLocal: hadoop fs -copyFromLocal
Examples (each command should be all on one line): hadoop fs -copyFromLocal /home/waterlinedata/data/ /user/waterlinedata/NewStagingArea/ hadoop fs -copyFromLocal /media/sf_Shared/data/ /user/waterlinedata/NewStagingArea/
10
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
Running Waterline Data Inventory jobs Waterline Data Inventory format discovery and profiling jobs are MapReduce jobs run in Hadoop. These jobs populate the Waterline Data Inventory repository with file format and schema information, sample data, and data quality metrics for files in HDFS and Hive.
Tag propagation, lineage discovery, collection discovery, and origin propagation jobs are jobs run on the edge node where Waterline Data Inventory is installed. These jobs use data from the repository to suggest relationships among files, to suggest additional tag associations, and to propagate origin information.
Waterline Data Inventory jobs are run on a command line on the computer on which Waterline Data Inventory is installed. The jobs are started using scripts located in the bin subdirectory in the installation location. For the VM, the installation location is /opt/waterlinedata. If you are running Waterline Data Inventory jobs in a development environment, consider opening two separate command windows: one for the Jetty console output and a second to run Waterline Data Inventory jobs.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
11
Waterline Data Inventory
Command
Description
Full profiling and tag propagation bin/waterline profile
Profiling bin/waterline profileOnly
Sandbox for CDH 5.3 and VirtualBox
Performs the initial profile of your cluster or on a regular interval to profile new and updated files. This command triggers profiling as well as the discovery processes that use profiling data. Consider running the lineage discovery command after this command completes. You can choose the directory to profile if you want to limit the scope of the profiling job. Profiles cluster content. Use this command after you’ve added files to the cluster but you aren’t ready to have Data Inventory suggest tags for the data. Example: bin/waterline profileOnly /user/Landing
Tag Propagation bin/waterline tag
Lineage Discovery bin/waterline runLineage
12
Propagates tags across the cluster. Use this command when you know that you haven’t added new files but you have tags and tag associations that you want Data Inventory to consider for propagation. Discovers lineage relationships and propagates origin information. Use this command when you have marked folders or files with origin labels and want that information propagated through the cluster. Include this command after the full profile for regular cluster profiling.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
Monitoring Waterline Data Inventory jobs Waterline Data Inventory provides a record of job history in the Dashboard of the browser application.
In addition, you can follow detailed progress of each job on the console where you run the command.
Monitoring Hadoop jobs When you run the “profile” command, you’ll see an initial job for format discovery followed by one or more profiling jobs. There will be at least one profiling job running in parallel for each file type Data Inventory identifies in the format discovery pass. The console output includes a link to the job log for the running job. For example: 14/11/01 21:15:40 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1414900584622_0001/
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
13
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
While the job is running, you can follow this link to see the progress of the MapReduce activity. Alternatively, you can monitor the progress of these jobs using Hue in a browser: For Cloudera distributions: http://:8888/jobbrowser
You’ll need to specify the waterlinedata user.
Monitoring local jobs After the Hadoop jobs complete, Waterline Data Inventory runs local jobs to process the data collected in the repository. You can follow the progress of these jobs by watching console output in the command window in which you started the job.
Debugging information There are multiple sources of debugging information available for Data Inventory. If you encounter a problem, collect the following information for Waterline Data support. •
Job messages on the console Waterline Data Inventory generates console output for jobs run at the command prompt. If the job encounters problems, you would review the console output for clues to the problem. To report errors to Waterline Data support, you would copy this output into a text file or email to help us follow what occurred: /opt/waterlinedata/bin/waterline profile
These messages appear on the console but are collected in a log file with debug logging level: /var/log/waterline/wds-inventory.log
•
Web server console output The embedded web server, Jetty, produces output corresponding to user interactions with the browser application. These messages appear on the console but are collected in a log file with debug logging level: /var/log/waterline/wds-ui.log
Use tail to see the most recent entries in the log: tail -f /var/log/waterline/wds-ui.log
•
Lucene search indexes In some cases, it may be useful to examine the search indexes produced by the product. These indexes are found in the following directory: /var/lib/waterline/index
14
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Waterline Data Inventory
•
Sandbox for CDH 5.3 and VirtualBox
Waterline Data Inventory repository In some cases it may be useful to examine the actual repository files produced by the product. The repository datastore is found in the following directory: /var/lib/waterline/db/waterlinedatastore
Configuring additional Waterline Data Inventory functionality Waterline Data Inventory provides a number of configuration settings and integration interfaces to enable extended functionality. The following sections describe a subset of those properties that you may find interesting for evaluating the product in the VM environment.
Profiling functionality The following properties control how Waterline Data Inventory collects data from HDFS files. Using samples to calculate data metrics By default, Waterline Data Inventory uses all data in files to calculate field-‐level metrics such as the minimum and maximum values, the cardinality and density of the values, and the most frequent values. You can achieve better profiling performance by sampling the file data for these operations. When sampling is enabled, Waterline Data Inventory reads the first and last blocks in the file and enough other blocks to reach the sample fraction you specify. For example, with a sample fraction of 10%, Waterline Data Inventory will read 6 blocks of a 250MB file, including the first block, the last block, and 4 additional blocks chosen at random (assuming a 4096 KB block size). [profiler.properties file] waterlinedata.profile.sampled=false (by default) waterlinedata.profile.sampled.fraction=0.1
(by default)
Re-‐profiling existing files By default, Waterline Data Inventory only profiles new files or files that have changed since the last profiling job. Change the following property to false to reprofile all files in the target directory. You might choose to do this if you add data formats or change other parameters that affect the profiling data collected. [profiler.properties file] waterlinedata.incremental=true (by default)
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
15
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
Configuring additional date formats When Waterline Data Inventory profiles string data such as in delimited files where no type information is available, it examines the data to reveal likely data types. It uses the format conventions described by the International Components for Unicode (ICU) conventions for dates and numeric values. You can add your own date formats using the conventions described here: http://icu-‐project.org/apiref/icu4j/com/ibm/icu/text/SimpleDateFormat.html The pre-‐defined formats are listed in the profiler properties file. [profiler.properties file] waterlinedata.profile.datetime.formats=EE MMM dd HH:mm:ss ZZZ yyyy, M/d/yy HH:mm, EEE MMM d h:m:s z yy, yy-MM-dd hh:mm:ss ZZZZZ, yy-MM-dd,yy-MM-dd HH:mm:ss,yy/M/dd,M/d/yy hh:mm:ss a, YYYY-MM-dd'T'HH:mm:ss.SSSSSSSxxx
Controlling most-‐frequent data values Waterline Data Inventory collects 750 of the most frequent values in each field in each file. You can change the number of values collected, control how many characters are included in each sample, and how many of these values are used in search indexes and to propagate tags. Number of most-‐frequent values collected [profiler.properties file] waterlinedata.profile.top_k_capacity=2000 (by default)
Size limit of strings [profiler.properties file] waterlinedata.max.top_k_length=128 (by default)
Number of most-‐frequent values used in search indexes [profiler.properties file] waterlinedata.profile.top_k=50 (by default)
Number of most-‐frequent values used to determine tag association matches [profiler.properties file] waterlinedata.profile.top_k_tokens=100 (by default)
Number of the most-‐frequent values shown in the user interface for a given field [profiler.properties file] waterlinedata.profile.top_k_capacity_tokens=750 (by default)
16
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
Hive functionality The following properties control interaction with Hive. For Hive connection information, see Communication between Waterline Data Inventory and Hive. Hive table profiling By default, Waterline Data Inventory does not profile Hive tables: from the Hive root in the browser application, users will see Hive tables, but schema-‐level details for the tables are not available. To include Hive tables in Waterline Data Inventory profiling jobs, set the following option to 'true'. [profiler.properties file] waterlinedata.profilehive=false (default)
Hive table creation Waterline Data Inventory allows users to indicate a file or directory to use as the source to create a Hive table in the Hive data store associated with the cluster. In addition, you can enable Waterline Data Inventory to create Hive tables for each file in HDFS at the time Waterline Data Inventory first profiles the file. Turning on this option has a large performance impact on profiling. [profiler.properties file] waterlinedata.createhivetables=false (default)
Discovery functionality The following properties control how Waterline Data Inventory makes suggestions for lineage relationships among files and for tag associations. Threshold for what suggestions are exposed Waterline Data Inventory gives a weight to its suggestions for matching tag associations. You can choose to expose more or fewer of these suggestions by configuring the cutoff weight. Tag associations whose calculated weight is below this value are not exposed to users. [discovery.properties file] waterlinedata.discovery.tolerance.weight=40.0 (by default)
Limit to the number of pre-‐defined tags that will be suggested for a given field waterlinedata.discovery.tags.max_suggested_ref_tables=3
Limit to the number of any tags that will be suggested for a given field waterlinedata.discovery.tags.max_suggested=3
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
17
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
Eliminating weak associations. If more than one tag is suggested for a field, the tag with the highest weight will be suggested; other tags must be within this value of the top tag for those tags to be suggested in addition to the top tag. waterlinedata.discovery.tags.value_hit_diff=20.0
Controlling collections discovery By default, Waterline Data Inventory only considers folders with 3 or more files in any one folder of a recursive tree) to be a candidate for a collection. You can control this value to better reflect the organization of your cluster. Note that there are other qualifications that must be met before the files in the folder are marked as a collection. [discovery.properties file] waterlinedata.discovery.smallest.collection.size=3 (by default)
Controlling lineage relationship discovery When reviewing files for lineage relationships, Waterline Data Inventory is able to tolerate a number of changes to file schemas and data and still find a connection among files. These properties control the parameters used to determine a lineage relationship. The amount of overlapping data between fields to consider the files matching. waterlinedata.discovery.lineage.ovelap=0.9 (by default)
If multiple fields from the same resource match the fields from another resource, Waterline Data Inventory uses field names to determine if the fields match. This mechanism is used only if field names are similar within the percentage indicated by this property, 0.8 (80%) by default. waterlinedata.discovery.lineage.field_name_match=0.8
Use HDFS last access date to limit lineage relationship candidates. The HDFS property dfs.namenode.accesstime.precision in hdfs-site.xml must be enabled. waterlinedata.discovery.lineage.use_access_time_filter=true
Limit the time between access of a parent file and creation of a child. This criteria is ignored (no time checking) if set to 0. waterlinedata.discovery.lineage.batch_window_hours=24
Accessing Hive tables Waterline Data Inventory makes it easy to create Hive tables from files in your cluster. You can access the Hive instance on the guest through Hue or by connecting to Hive from other third-‐party query or analysis tools.
18
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Waterline Data Inventory
Sandbox for CDH 5.3 and VirtualBox
Viewing Hive tables in Hue You can access the Hive tables in your cluster through Hue using the Beeswax query tool: http://:8888/beeswax
Connecting to the Hive datastore To access Hive tables from Tableau, Qlik Sense, or other analysis tool, you’ll need to configure a connection to the Hive datastore on the cluster. For a Waterline Data-‐ supplied cluster, use the following connection information:
Parameter
Value
Server
Use the same server IP address as you use for Waterline Data Inventory
Port
10000
Server Type
HiveServer2
Authentication
Username and Password
Username
waterlinedata
Password
waterlinedata
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
19