This collaboration will produce a framework for extracting novel science from large amounts of data in an environment where the computational needs vastly outweigh the available facilities, and intelligent (and dynamic) resource allocation is required. New statistical theory will be developed that will allow current machine learning paradigms to scale to large parallel computing environments. The core result is the production, for projects generating thousands of gigabytes of new data a night (such as the proposed Large Synoptic Survey Telescope), of calibrated probabilistic statements about the physical nature of astronomical events. Uncovering anomalous events that do not fit easily into currently accepted classification taxonomy – events that may lead to completely new scientific discoveries. These tools should find applicability in any large data set where real-time response, informed intelligently by the data, is required.
Recent years have seen the advent of high-bandwidth sensor networks. They offer tremendous potential for ecosystem monitoring, such as early forest fire detection, stream level monitoring and early flood prediction, and monitoring soil ecology at unprecedented spatial and temporal resolutions. Other examples include seismic monitoring, target tracking, battlefield surveillance, security monitoring, medical care, traffic monitoring, and pollutant monitoring. Although each of these areas have unique characteristics and requirements, each face data floods composed of multiple time series with a need for automatic classification. The technologies that facilitate such high bandwidth data collection also create demand for new computational and statistical procedures. There is every reason to think that our efforts will contribute to the evolving methodologies.
Indeed, in year 3 (2011-2012) of the proposed effort we will begin to formally interface with these communities starting with the multidisciplinary 2-day workshop (“Real-Time Knowledge Extraction from Massive data streams”).
Our collaboration is sponsored by an NSF-CDI grant (award #0941742) “Real-time Classification of Massive Time-series Data Streams” (PI: Bloom).