What is commonly considered the World Wide Web is a small fraction of the data available on the Internet. The volume of hypertext accessible to conventional search engines is 400 to 550 times smaller than the 7.5 petabytes of networked databases from directory services, information portals, scientists, government agencies and other providers. Our goal is to explore the mechanisms for and consequences of aggressively leveraging this underutilized resource.
This data is often referred to as the deep web or the hidden web, but that nomenclature is misleading, since the data it refers to is neither hyperlinked nor text, and hence not much like the World Wide Web. To highlight these distinctions from the World Wide Web, we refer to this data as the Federated Facts and Figures on the Internet, or simply the FFF.
The work we propose has a number of goals. First, we wish to study algorithms and develop systems that will enable effective, easy-to-use tools for exploiting facts and figures on the Internet. To this end, we propose a number of systems research problems in the context of a prototype system called Telegraph that is under development at Berkeley. One aspect of our proposal is to vigorously pursue Telegraph's nascent agenda to develop adaptive techniques for query processing, which can nimbly adjust to the volatility of performance and data characteristic of the Internet. Another aspect is to extend Telegraph with the capability to trawl large amounts of data from the FFF, by running recursive queries over multiple data sources.
The second goal of the proposal is to explore the ramifications of providing FFF tools to the broad Internet user base, which is likely to include multiple parties, some of whom have adversarial intentions. To help motivate these problems, we discuss our experience developing an initial application over Telegraph, which combines data from various FFF sources to provide insights into the campaign finances of the recent presidential election. This application was placed live on the web in the month before the election, and displayed publicly-available but nonetheless surprising combinations of data both about individual donors and larger demographic trends. In designing the application, we became sensitive to a number of issues related to privacy, data quality, and the economics of vigorously exploiting currently free Internet services issues that we propose to study more deeply.
In light of these issues, our third goal in this proposal is to explore the design space of countermeasures that can prevent FFF technologies from being misused. On this count, we discuss initial ideas in detecting undesired bulk data access, in better ensuring the quality of combined data, and in enabling clients to understand how servers are using their personal information. The proposed work cuts across a variety of research areas including databases, algorithms, machine learning, web information retrieval, economics, and economic policy.
