Domain Discovery Tool - SLIDEBLAST.COM

Report 90 Downloads 72 Views
Domain Discovery Tool Visualization and Data Analysis Lab (NYU)

Yamuna Krishnamurthy email [email protected]

Juliana Freire email [email protected]

September 4, 2015

1 Motivation The wide availability of data on the Web has been a valuable asset for many applications. But it has also made it hard to find specific kinds of information. Search engines, such as Google and Bing, make use of massive computing power to both crawl the Web and create the search indexes, which currently cover hundreds of billions of documents. Because they aim to maximize coverage and breadth, queries to these systems often return a very large number of results, including many that are of little relevance to a user’s information needs. Sifting through these results is time consuming, and manually refining queries to focus the search is challenging.

Figure 1: Domain Discovery Tool: Architecture The Domain Discovery Tool (DDT), whose architecture is shown in Figure 1, is an interactive system that helps users explore and better understand a domain (or topic) as it is represented on the Web. It achieves this by integrating human insights with machine computation (data mining and machine learning) through visualization. DDT allows a domain expert to visualize and analyze pages returned by a search engine or a crawler, and easily provide feedback about relevance. This feedback, in turn, can be used to address two challenges: • Guide users in the process of domain understanding and help them construct effective queries to be issued to a search engine; and

1

• Configure focused crawlers that efficiently search the Web for additional pages on the topic. DDT allows users to quickly select crawling seeds as well as positive and negatives required to create a page classifier for the focus topic. In this pilot study, our goal is to use DDT to bootstrap focused Web search. We note, however, that DDT can also be used to explore the pages retrieved by a focused crawler (see Figure 1). We are currently extending the tool for this purpose.

2 Focused Search The search component of the DDT, shown in Figure 2, leverages general-purpose search engines like Google or Bing that already hold a large portion of the Web. A user can start the search process by posing queries that consist of terms relevant for the topic of interest. These queries are issued Google or Bing search and a representative set of results is retrieved, say 1000 top pages. The search engines, however, return a set of diverse results most relevant to the query terms and not necessarily the particular topic of interest. An important goal of DDT is to make it easy for users to provide feedback with respect to the relevance of the pages. Users can mark individual pages as relevant or irrelevant. The tool also clusters similar pages together and allows users to interact with the results through a multidimensional scaling visualization shown in Figure 2: a user can interactively select clusters (or set of pages) and tag multiple pages simultaneously.

Figure 2: Domain Discovery Tool: User interface components The tags assigned to the pages are stored in a database and used as follows: 1. Relevant and irrelevant pages are further processed to extract more topic related terms and phrases which can be used to further learn about the topic and refine queries. The terms and phrases extracted from the pages are then issued to Google or Bing search engines as either new

queries or appended to the old query. New queries enable diversifying the search to other related sub topics, hence covering different areas of the Web. Appending to existing queries help to focus the search more on topic especially when the initial search queries are ambiguous. 2. Relevant and irrelevant pages are used to build a page classifier for the topic of interest. This classifier is a key component of a focused crawler, and can also be used to filter pages obtained from search engines. 3. The relevant pages can serve as seeds for a focused crawl.

3 DDT Search Component: Detailed Description The following are the core features and functionality of the DDT’s focused search interface as shown in Figure 2. Data Source Data source, corresponding to a domain, stores the pages downloaded by DDT as a result of queries issued to search engines or crawled by a focused crawler. Currently the DDT is configured to connect to an elasticsearch engine, which indexes the pages. The domains can be selected from the drop down list on the menu (see top-right corner in Figure 2). New domains can be added through the ’Add Donmain’ in the same list on the menu. Web Query A unique feature of the DDT search component is the capability to query the Web using Google or Bing. This allows leveraging the already crawled pages of Google and Bing to discover interesting on-topic pages across the Web using on-topic query terms. The search is done by entering query terms in the ’Query terms:’ text box. The results are stored in the selected data source for later analysis and to be used as seeds for focused crawlers. Downloaded Pages Summary This pane displays a summary of the number of pages downloaded either as a result of a Web query or crawling by a focused crawler. When filter, explained below, is not applied the numbers correspond to the total pages stored in the data source. When filter is applied it corresponds to the filtered subset of the pages in the data source. The statistics shown are as follows: • Crawled pages: number of pages crawled. • Relevant pages: number of pages marked as relevant • Irrelevant pages: number of pages marked as irrelevant • Neutral pages: pages not marked as relevant or irrelevant as yet • New pages: number of pages being downloaded in the background. This indicates that there are new pages yet to be analyzed. Update Pages Since downloading a large number of pages takes significant time, DDT does this on the background. The new pages field in the Page Summary indicates the number of pages downloaded since the last refresh. These are pages that have not been analyzed as yet. The Update Pages button adds the most recently downloaded pages to the multidimensional scaling view. The last updated date/time reflects the most recently downloaded page. Multidimensional Scaling Visualization Another unique component of the DDT is the multidimensional scaling visualization. This allows for a 2-dimensional representation of the downloaded pages based on clustering algorithms showing the similarity or dissimilarity of the pages. This in turn facilitates analyzing, and marking as relevant

and irrelevant, groups of pages as opposed to each individual page, thereby considerably reducing the analysis (and labeling) time. Currently, clustering is done using principal component analysis(PCA) [3] with TFIDF [2] of the page terms. Other clustering algorithms can be plugged in easily. The visualization window allows zooming in/out and dragging while keeping ’z key’ pressed. Details of Selected Pages In order to allow more detailed analysis of the pages downloaded, DDT dynamically creates snippets that display most relevant image and text extracted from the selected pages. Pages can be selected by a free lasso selection in the multidimensional scaling visualization window. If the image and text snippets are not sufficiently informative then one can ’shift+click’ to display part of the page in a pop-up window; a subsequent ’shift+click’ opens the page in a Web browser. Page Tagging The selected pages can be tagged as ’Relevant’ or ’Irrelevant’ by clicking on 3on the corresponding button under the scaling window. They can be untagged using the 5. The tags for the corresponding pages are stored in the elasticsearch index. The stored tags are used to compute page summary by aggregating the relevant, irrelevant and neutral pages. The tags are also used to build a page classifier for the focused crawler. Filter Sometimes the user wants to see just a subset of the pages downloaded. Filtering the pages of the data source by specified terms contained in the text of the pages allows analyzing subsets of the data. The Cap allows to select the number of pages to be returned after filtering. The ’From’ and ’To’ dates allow to filter for a specific date range. Extracted Terms The most relevant terms and phrases are extracted from the downloaded pages. Relevance is determined by standard deviation of term frequencies (TFIDF [2]) from a base corpus like Wikipedia. The terms are displayed with their frequency of occurrence in relevant (blue) and irrelevant (red) pages (bars to the right of the Terms panel). This helps the expert to select terms that are more discerning of relevant pages. Terms can be tagged as ’Positive’ and ’Negative’ by 1-click and 2-click respectively. The tags are stored in the active data source. When the update pages button is clicked the ’Positive’ and ’Negative’ tags are used to re-rank the terms using Bayesian Sets [1]. Terms help the expert understand and discover new information about the domains of interest. ’shift+click’ on a term adds the term to the ’Query terms:’ to either refine the Web search or start new sub topic searches. Custom relevant and irrelevant terms can be added in the text box below the Terms panel to boost the extraction of more relevant terms. These custom terms are distinguised by the before them which can be clicked to delete the custom term. Terms Context Hovering the mouse over the terms in the Terms window displays the context in which they appear on the pages. This again helps the expert understand and disambiguate the relevant terms.

References [1] Zoubin Ghahramani and Katherine A. Heller. Bayesian sets. In Advances in Neural Information Processing Systems, 2005. [2] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513–523, August 1988. [3] Michael E. Tipping and Chris M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 61:611–622, 1999.