Cornell University Center for Advanced Computing (CAC)

Press Release

Cornell develops analysis tools for large-scale web data

Contact: Paul Redfern
Cell: (607) 227-1865

FOR RELEASE: February 12, 2009

ITHACA, N.Y. – The web collection of the Internet Archive is a rich research resource for social scientists and economists who study the dynamics of social networks and markets over time. The size and complexity of archived web data, however, can pose serious data processing and management challenges.

To meet these challenges, Cornell University developed a family of data analysis software tools. “These tools are part of the Web Lab project, a joint project of Cornell and the Internet Archive,” explained William Arms, Professor, Computer Science. “The aim of the Web Lab is to organize large portions of these collections, so that they can be used by researchers who are not experts in data-intensive computing."

One of the tools developed is the “Web Lab Collaboration Server,” a service for large-scale collaborative web data analysis. This tool demonstrates how to seamlessly support non-technical users during the search, extraction, and analysis of web data.

Cornell periodically transfers web crawls from the Internet Archive in San Francisco through a high-speed National Science Foundation TeraGrid connection. “To date, more than four complete web crawls constituting billions of pages have been downloaded to the Cornell Center for Advanced Computing (CAC),” explained Johannes Gehrke, Associate Professor, Computer Science.”

However, even with the wealth of data publically available, Gehrke and the Cornell Database Group note that there are three major obstacles in creating effective and practical data analysis applications, namely: (1) customized data sets must be prepared by writing extraction scripts tailored to the task at hand; (2) data sets must be cleaned or formatted, a step often needlessly repeated by different users; and, (3) analysis code must be written to take advantage of parallelism, shared memory, or distributed computing power and storage.

To overcome these obstacles and deliver an end-to-end solution, Cornell based the development of the Web Lab Collaboration Server service on key user requirements. First, because many users of web data are experts in domains outside of computer science, the project team developed a simple intuitive GUI for complex extraction and analysis tasks, and built on established MapReduce technology in the backend. Second, because data extraction, cleaning and formatting are time-consuming and tedious tasks, once data sets are prepared for analysis, they are managed in a central repository to enable reuse and sharing among a community of researchers. Finally, a web-based, software-as-a-service architecture was chosen to enable users to leverage a powerful distributed computing and archiving platform for their extraction and analysis tasks. Applications use the infrastructure to process web crawls from the Internet Archive through remote services.

The “Content Service” provides clients with access to the full text and metadata of archived web pages and to extracted data sets and analysis results. A Web Lab Service API lets clients access the data repository through a variety of APIs. The “Visual Wrapper Generator” allows non-technical users to extract structured data from web pages without writing any code. Users can analyze extracted datasets in various ways using the “Visual Analysis Workbench.” This client-server application acts as a front end to a MapReduce Cluster that is used for large-scale data analysis tasks on the contents in the repository.

A paper describing the data analysis service, “Large-Scale Collaborative Analysis and Extraction of Web Data,” by Weigel, Panda, Riedewald, Gehrke, and Calimlim, was recently published in the Proceedings of the Very Large Database Endowment. A Flash video of an extraction process is available.

“The Cornell Database Group is well-versed at turning large-scale collections of rich, unstructured and noisy data into intuitive resources that are readily accessible to a variety scientific disciplines,” concluded David Lifka, Director, Cornell Center for Advanced Computing.

The Web Lab is funded by the National Science Foundation (NSF). The Cornell Center for Advanced Computing receives support from Cornell University, NSF, DOD, USDA, and members of its corporate program.

About

Press Release

Cornell develops analysis tools for large-scale web data