Waterline Data Inventory Sandbox for MapR 4.0 and VMWare Product Version 1.1.0 Document Version 1.0
© 2015 Waterline Data, Inc. All rights reserved. All other trademarks are the property of their respective owners.
Waterline Data Inventory
Sandbox for MapR 4.0 and VMWare
Table of Contents Overview ..................................................................................................................................... 3 Related Documents ................................................................................................................. 3 System requirements ............................................................................................................. 3 Setting up Waterline Data Inventory VM sandbox for VMWare .............................. 4 Running Waterline Data Inventory ............................................................................................. 5 Opening Waterline Data Inventory in a browser ................................................................... 5 Exploring the sample cluster ......................................................................................................... 5 Accessing the Hadoop cluster using SSH ......................................................................... 8 Loading data into MapR-‐FS ................................................................................................... 8 Using Hue to load files into MapR-‐FS .......................................................................................... 9 Loading files into MapR-‐FS from a command line .................................................................. 9 Running Waterline Data Inventory jobs ....................................................................... 11 Monitoring Waterline Data Inventory jobs ................................................................. 13 Monitoring Hadoop jobs .............................................................................................................. 13 Monitoring local jobs .................................................................................................................... 14 Debugging information ................................................................................................................ 14 Configuring additional Waterline Data Inventory functionality .......................... 15 Profiling functionality .................................................................................................................. 15 Hive functionality ........................................................................................................................... 17 Discovery functionality ................................................................................................................ 17 Accessing Hive tables .......................................................................................................... 18 Viewing Hive tables in Hue ......................................................................................................... 19 Connecting to the Hive datastore .............................................................................................. 19
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
2
Waterline Data Inventory
Sandbox for MapR 4.0 and VMWare
Overview Waterline Data Inventory reveals information about the metadata and data quality of files in a Apache™ Hadoop® cluster so the users of the data can identify the files they need for analysis and downstream processing. The application installs on an edge node in the cluster and runs MapReduce jobs to collect data and metadata from files in MapR-‐FS and Hive. It then discovers relationships and patterns in the profiled data and stores the results in its metadata repository. A browser application lets users search, browse, and tag files and Hive tables using the benefits of the collected metadata and Data Inventory’s discovered relationships. This document describes setting up a virtual machine image that is pre-‐configured with the Waterline Data Inventory application and sample cluster data. The image is built from MapR Technologies™ MapR 4.0 sandbox on VMWare® Player™ or VMWare Fusion®.
Related Documents •
Waterline Data Inventory User Guide (also available from the browser application)
menu in the
For the most recent documentation and product tutorials, see the documents available where you download the product and VM images: waterlinedata.com/downloads.
System requirements Waterline Data Inventory sandbox is available inside the MapR 4.0 sandbox. The system requirements and installation instructions are the same as MapR describes: doc.mapr.com/display/MapR/MapR+Sandbox+for+Hadoop The Waterline Data Inventory sandbox is configured with 10 GB of physical RAM rather than the default of 4 GB. The basic requirements are as follows For your host computer: •
64-‐bit computer that supports virtualization. VMWare describes the unlikely cases where your hardware may not be compatible with 64-‐bit virtualization: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd =displayKC&externalId=1003945
• •
Operating system supported by VMWare Player, including Microsoft® Windows® (XP and later) or VMWare Fusion, including Apple® Mac® OS X. At least 10 GB of RAM
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
3
Waterline Data Inventory •
Sandbox for MapR 4.0 and VMWare
VMWare virtualization application for your operating system. Download the latest version from here: Player (Windows): my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_pla yer/6_0 Fusion (Mac): www.vmware.com/products/fusion/
•
Waterline Data Inventory VM image built on MapR 4.0 sandbox, VMWare version. waterlinedata.com/downloads
Browser compatibility • • • •
Microsoft Internet Explorer 10 and later (not supported on Mac OS) Chrome 36 or later Safari 6 or later Firefox 31 or later
Setting up Waterline Data Inventory VM sandbox for VMWare 1. Install VMWare. 2. Download the Waterline Data Inventory VM (.ova file) 3. Open the .ova file with VMWare (double-‐click the file). 4. Click Import to accept the default settings for the VM. This will take a few minutes to expand the archive and create the guest environment. 5. (Optional) Configure a way to easily move files between the host and guest. Some options are: • Configure a shared directory between the host and guest. (Settings > Shared Folders, specify auto-‐mount) • Setup a bi-‐directional clipboard • For MapR, mount the cluster file system via NFS (see MapR documentation topic "Accessing Data with NFS") 6. Start the VM. It will take a few minutes for Hadoop and its components startup. 7. Note the IP address used for SSH access, such as 127.0.0.1 or "maprdemo" 8. Log in as waterlinedata/waterlinedata.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
4
Waterline Data Inventory
Sandbox for MapR 4.0 and VMWare
Running Waterline Data Inventory 1. Open a terminal or command prompt on the host and connect to the guest. ssh waterlinedata@maprdemo -p2222
Enter the password when prompted ("waterlinedata"). 2. Start the embedded metadata repository database, Derby. cd waterlinedata bin/derbyStart
Type Enter to return to the shell prompt. 3. Start the embedded web server, Jetty. bin/jettyStart
The console fills with status messages from Jetty. Only messages identified by "ERROR" or "exception" indicate problems. You are now ready to use the application and its sample data.
Opening Waterline Data Inventory in a browser The sandbox includes pre-‐profiled data so you can see the functionality of Waterline Data Inventory before you load your own data. 1. Open a browser to the Waterline Data Inventory application: http://maprdemo:8082
or http://:8082
2. Sign into Waterline Data Inventory using any of the Linux users configured for your system, including "waterlinedata". 3. The VM image is configured with the following additional ports that allow access to the guess:
Port
Application Component
8082 10000 19888 4444 8888 8443
Waterline Data Inventory browser application Hive Hadoop job history Derby Hue MapR MCS
Exploring the sample cluster The Waterline Data Inventory sandbox is pre-‐populated with public data to simulate a set of users analyzing and manipulating the data. As you might expect among a © 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
5
Waterline Data Inventory
Sandbox for MapR 4.0 and VMWare
group of users, there are multiple copies of the same data, standards for file and field names are not consistent, and data is not always wrangled into forms that are immediately useful for analysis. In other words, the data is intended to reflect reality. Here are some entry points to help you use this sample data to explore the capabilities of Waterline Data Inventory: Tags Tags help you identify data that you may want to use for analysis. When you place tags on fields, Waterline Data Inventory looks for similar data across the profiled files in the cluster and suggests your tags for other fields. Use the tags you enter and automatically suggested tags in searches and search filtering with facets. In the sample data, look for tags for "Food Service" data.
Lineage relationships, landings, and origins Waterline Data Inventory uses file metadata and data to identify cluster files that are related to each other. It finds copies of the same data, joins between files, and horizontal and vertical subsets of files. If you mark the places where data comes
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
6
Waterline Data Inventory
Sandbox for MapR 4.0 and VMWare
into the cluster with "Landing" labels, Waterline Data Inventory propagates this information through the lineage relationships to show the origin of the data. In the sample data, look for origins for "data.gov," "nyc_open," "Restaurant Inspections," and "Twitter."
Searching with facets Use the Global Search text box on the top of the page to do keyword searches across your cluster metadata, including searching on file and field names, tags and tag descriptions, 50 examples of the most frequent data in each field. Waterline Data Inventory also provides search facets on common file and field properties, such as file size and data density. Some of the most powerful facets are those for tags and origins. Use the facet lists on the Advance Search page to identify what kind of data you want to find. Then use facets in the left pane to refine the search results further. In the sample data, use "Food Service" tags in the Advance Search page, then filter the results by origin, such as "Restaurant Inspections".
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
7
Waterline Data Inventory
Sandbox for MapR 4.0 and VMWare
Accessing the Hadoop cluster using SSH To run Waterline Data Inventory jobs and to upload files in bulk to MapR-‐FS, you will want to access the guest machine using a command prompt or terminal on your host computer through a Secure Shell (SSH) connection. Alternatively, you can use the terminal in the guest VMWare window, but that can be awkward. 1. In a terminal window (Mac) or command prompt (Windows), start an SSH session using the IP address provided for the guest instance and the username waterlinedata, all lower case: ssh waterlinedata@maprdemo -p2222
or ssh waterlinedata@localhost -p2222
2. You may be prompted to continue connecting though the authenticity of the host cannot be established. Enter yes. 3. Enter the waterlinedata user password "waterlinedata".
Loading data into MapR-‐FS Loading data into MapR-‐FS is a two stage process: first you load data from its source— such as your local computer or a public website—to the guest file system. Then you copy the data from the guest file system into MapR-‐FS. For a small number of files, the Hadoop utility Hue makes this process very easy by allowing you to select files from the host computer and copy them directly into MapR-‐FS. For larger © 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
8
Waterline Data Inventory
Sandbox for MapR 4.0 and VMWare
files or large numbers of files, you may decide to use a combination of an SSH client (to move files to the guest machine) and a command-‐line operation (to move files from the guest file system to MapR-‐FS). If you have a shared directory configured between the host and guest, you can access the files directly from the guest.
Using Hue to load files into MapR-‐FS To access Hue from a browser on the host computer: http://: