Waterline Data Inventory
Sandbox Setup Guide for HDP 2.4 and VirtualBox Product Version 2.5 Document Version 6.15.2015
© 2014 - 2016 Waterline Data, Inc. All rights reserved. All other trademarks are the property of their respective owners.
Sandbox Setup Guide for HDP 2.4 and VirtualBox
Overview
Table of Contents Overview ......................................................................................................................................... 2 Related Documents ......................................................................................................................... 3 System requirements ...................................................................................................................... 3 Setting up the sandbox ................................................................................................................... 4 Opening Waterline Data Inventory in a browser ............................................................................ 5 Running Waterline Data Inventory ................................................................................................. 5 Exploring the sample cluster ........................................................................................................... 5 Shutting down the cluster ............................................................................................................... 8 Accessing the Hadoop cluster using SSH ......................................................................................... 9 Loading data into HDFS ................................................................................................................... 9 Running Waterline Data Inventory jobs ............................................ Error! Bookmark not defined. Monitoring Waterline Data Inventory jobs ....................................... Error! Bookmark not defined. Configuring additional Waterline Data Inventory functionality ........ Error! Bookmark not defined. Accessing Hive tables ........................................................................ Error! Bookmark not defined.
Overview Waterline Data Inventory reveals information about the metadata and data quality of files in a Apache™ Hadoop® cluster so the users of the data can identify the files they need for analysis and downstream processing. The application installs on an edge node in the cluster and runs MapReduce jobs to collect data and metadata from files in HDFS and Hive. It then discovers relationships and patterns in the profiled data and stores the results in its metadata repository. A browser application lets users search, browse, and tag HDFS files and Hive tables using the benefits of the collected metadata and Data Inventory’s discovered relationships. This document describes running Waterline Data Inventory in a virtual machine image that is pre-‐configured with the Waterline Data Inventory application and sample cluster data. The image is built from Hortonworks™ HDP 2.4 sandbox on Oracle® VirtualBox™.
© 2014 - 2016 Waterline Data, Inc. All rights reserved.
2
Sandbox Setup Guide for HDP 2.4 and VirtualBox
Related Documents
Related Documents •
Waterline Data Inventory User Guide (also available from the menu in the browser application)
For the most recent documentation and product tutorials, sign into Waterline Data Inventory support (support.waterlinedata.com) and go to "Product Downloads, Documentation, and Tutorials":
System requirements Waterline Data Inventory sandbox is available inside the Hortonworks HDP 2.4 sandbox. The system requirements and installation instructions are the same as Hortonworks describes: hortonworks.com/products/sandbox/#install The Waterline Data Inventory sandbox is configured with 8 GB of physical RAM rather than the default of 4 GB. The basic requirements are as follows For your host computer: •
At least 10 GB of RAM
•
64-‐bit computer that supports virtualization. VirtualBox describes the unlikely cases where your hardware may not be compatible with 64-‐bit virtualization: www.virtualbox.org/manual/ch10.html#hwvirt
© 2014 - 2016 Waterline Data, Inc. All rights reserved.
3
Sandbox Setup Guide for HDP 2.4 and VirtualBox •
Setting up the sandbox
Operating system supported by VirtualBox, including Microsoft® Windows® (XP and later), many Linux distributions, Apple® Mac® OS X, Oracle Solaris®, and OpenSolaris™. www.virtualbox.org/wiki/End-‐user_documentation
•
VirtualBox virtualization application for your operating system. Download the latest version from here: www.virtualbox.org
•
Waterline Data Inventory VM image built on Hortonworks HDP 2.4 sandbox, VirtualBox version. www.waterlinedata.com/downloads
Browser compatibility • • • •
Microsoft Internet Explorer 10 and later (not supported on Mac OS) Chrome 36 or later Safari 6 or later Firefox 31 or later
Setting up the sandbox 1. Install VirtualBox. 2. Download the Waterline Data Inventory VM (.ova file) 3. Open the .ova file with VirtualBox (double-‐click the file). 4. Click Import to accept the default settings for the VM. This will take a few minutes to expand the archive and create the guest environment. 5. (Optional) Configure a way to easily move files between the host and guest. Some options are: • Configure a shared directory between the host and guest. (Settings > Shared Folders, specify auto-‐mount). From the guest computer, you can access the shared folder at /media/sf_<shared folder name>). • Setup a bi-‐directional clipboard 6. Start the VM. It will take a few minutes for Hadoop and its components startup. 7. Note the IP address used for SSH access, such as 127.0.0.1 so that you can log into the guest machine through SSH as waterlinedata/waterlinedata.
© 2014 - 2016 Waterline Data, Inc. All rights reserved.
4
Sandbox Setup Guide for HDP 2.4 and VirtualBox
Opening Waterline Data Inventory in a browser
Opening Waterline Data Inventory in a browser The sandbox includes pre-‐profiled data so you can see the functionality of Waterline Data Inventory before you load your own data. 1. Open a browser to the Waterline Data Inventory application: http://localhost:8082
or http://:8082
2. Sign into Waterline Data Inventory as "waterlinedata", password “waterlinedata”.
Running Waterline Data Inventory If for some reason the browser application did not appear, you may need to sign into the guest and start Waterline Data Inventory manually. If so, follow these steps: 1. Start an SSH session. (Mac OSX) Open a terminal or command prompt on the host and connect to the guest. $ ssh
[email protected] -p2222
Enter the password when prompted ("waterlinedata"). (Windows) Start an SSH client such as PuTTY and identify the connection parameters: • Host Name: the guest IP address (from step 7 above). • Protocol: SSH Log in using username “waterlinedata” and password “waterlinedata”. 2. You may be prompted to continue connecting though the authenticity of the host cannot be established. Enter yes. 3. Start the embedded metadata repository database, Derby. $ cd /opt/waterlinedata/bin $ ./waterline serviceStart
The console fills with status messages from the Jetty web server. Only messages identified by "ERROR" or "exception" indicate problems. You are now ready to use the application and its sample data.
Exploring the sample cluster The Waterline Data Inventory sandbox is pre-‐populated with public data to simulate a set of users analyzing and manipulating the data. As you might expect among a group of users, there are multiple copies of the same data, standards for file and
© 2014 - 2016 Waterline Data, Inc. All rights reserved.
5
Sandbox Setup Guide for HDP 2.4 and VirtualBox
Exploring the sample cluster
field names are not consistent, and data is not always wrangled into forms that are immediately useful for analysis. In other words, the data is intended to reflect reality. Here are some entry points to help you use this sample data to explore the capabilities of Waterline Data Inventory:
Tags Tags help you identify data that you may want to use for analysis. When you place tags on fields, Waterline Data Inventory looks for similar data across the profiled files in the cluster and suggests your tags for other fields. Use the tags you enter and automatically suggested tags in searches and search filtering with facets. In the sample data, look for tags for "Food Service" data.
Lineage relationships, landings, and origins Waterline Data Inventory uses file metadata and data to identify cluster files that are related to each other. It finds copies of the same data, joins between files, and horizontal and vertical subsets of files. If you mark the places where data comes into the cluster with "Landing" labels, Waterline Data Inventory propagates this information through the lineage relationships to show the origin of the data. © 2014 - 2016 Waterline Data, Inc. All rights reserved.
6
Sandbox Setup Guide for HDP 2.4 and VirtualBox
Exploring the sample cluster
In the sample data, look for origins for "data.gov," "Twitter," and "Restaurant Inspections."
© 2014 - 2016 Waterline Data, Inc. All rights reserved.
7
Sandbox Setup Guide for HDP 2.4 and VirtualBox
Shutting down the cluster
Searching with facets Use the Global Search text box on the top of the page to do keyword searches across your cluster metadata, including searching on file and field names, tags and tag descriptions, 50 examples of the most frequent data in each field. Waterline Data Inventory also provides search facets on common file and field properties, such as file size and data density. Some of the most powerful facets are those for tags and origins. Use the facet lists on the Advance Search page to identify what kind of data you want to find. Then use facets in the left pane to refine the search results further. In the sample data, use "Food Service" tags in the Advance Search page, then filter the results by origin, such as "Restaurant Inspections".
Shutting down the cluster To make sure you can restart the cluster cleanly, follow these steps to shut it down: 1. Shut down the cluster. Choose Machine > Close > ACPI Shut Down. If you don't see this option, press the Option key while opening the menu.
© 2014 - 2016 Waterline Data, Inc. All rights reserved.
8
Sandbox Setup Guide for HDP 2.4 and VirtualBox
Accessing the Hadoop cluster using SSH
Accessing the Hadoop cluster using SSH To run Waterline Data Inventory jobs and to upload files in bulk to HDFS, you will want to access the guest machine using a command prompt or terminal on your host computer through a Secure Shell (SSH) connection. Alternatively, you can use the terminal in the guest VirtualBox window, but that can be awkward. 1. Start an SSH session. (Mac OSX) In a terminal window, start an SSH session using the IP address provided for the guest instance (step 7 on page 4) and the username waterlinedata, all lower case: $ ssh waterlinedata@ -p2222
or $ ssh waterlinedata@localhost -p2222
(Windows) Start an SSH client such as PuTTY and identify the connection parameters: • Host name: the guest IP address (step 7 on page 4). • Protocol: SSH Log in using username “waterlinedata” and password “waterlinedata”. 2. You may be prompted to continue connecting though the authenticity of the host cannot be established. Enter yes.
Loading data into HDFS Loading data into HDFS is a two stage process: first you load data from its source— such as your local computer or a public website—to the guest file system. Then you copy the data from the guest file system into HDFS. For a small number of files, the Hadoop utility Hue makes this process very easy by allowing you to select files from the host computer and copy them directly into HDFS. For larger files or large numbers of files, you may decide to use a combination of an SSH client (to move files to the guest machine) and a command-‐line operation (to move files from the guest file system to HDFS). If you have a shared directory configured between the host and guest, you can access the files directly from the guest.
Using Hue to load files into HDFS To access Hue from a browser on the host computer: http://: Directory
Create a new directory inside the current directory. Feel free to create additional /user directories. Note: Avoid adding directories above /user because it complicates accessing these locations from the Linux command line.
Upload > Files
Hue allows you to use your local file system to select and upload files. Note: Avoid uploading zip files unless you are familiar with uncompressing these files from inside HDFS.
Move to Trash > Delete Forever
“Trash” is just another directory in HDFS, so moving files to trash does not remove them from HDFS.
Loading files into HDFS from a command line Copying files to HDFS is a two-‐step process requiring an SSH connection: 1. Make the data accessible from guest machine. There are several ways to do this: • Use an SSH client such as PuTTY, FileZilla, or CyberDuck. • Use secure copy (scp). • Configure a shared directory in the VirtualBox settings for the VM. 2. From inside an SSH connection, use the Hadoop file system command copyFromLocal to move files from the guest file system into HDFS. The following steps describe using scp to copy files into the guest. Skip to step 5 if you chose to use a GUI SSH client to copy the files. These instructions have you use separate terminal windows or command prompts to access the guest machine using two methods: •
(Guest) indicates the terminal window or command prompt with an open SSH connection.
•
(Host) indicates the terminal window or command prompt that uses scp directly.
© 2014 - 2016 Waterline Data, Inc. All rights reserved.
10
Sandbox Setup Guide for HDP 2.4 and VirtualBox
Loading data into HDFS
To copy files from the host computer to HDFS on the guest: 1. (Guest) Open an SSH connection to the guest. See Accessing the Hadoop cluster using SSH. 2. (Guest) Create a staging location for your data on the guest file system. The SSH connection working directory is /home/waterlinedata. From here, create a directory for your staged data: $ mkdir /data
3. (Guest) If needed, create HDFS directories into which you will copy the files. Create the directories using Hue or using the following command inside an SSH connection: $ hadoop fs -mkdir
For example, to create a new directory in the Landing directory: $ hadoop fs -mkdir /user/waterlinedata/NewStagingArea
4. (Host) In a separate terminal window or command prompt, copy directories or files from host to guest. Navigate to the location of the data to copy on the host and run the scp command: $ scp -r ./ waterlinedata@:
For example (all on one line): $ scp -r ./NewData waterlinedata@localhost:/home/waterlinedata/data -p2222 $ scp -r ./NewData
[email protected]:/home/waterlinedata/data
5. (Guest) Back in the SSH terminal window or command prompt, copy the files from guest file system to the cluster using the HDFS command copyFromLocal. Navigate to the location of the data files you copied in step 4 and copy the files into HDFS using the following command: $ hadoop fs -copyFromLocal
For example (all on one line): $ hadoop fs -copyFromLocal /home/waterlinedata/data/ /user/waterlinedata/NewStagingArea/
© 2014 - 2016 Waterline Data, Inc. All rights reserved.
11