Deployment and Use of Very Large Databases for Next-generation Particle Physics Experiments

Julian J. Bunn and Harvey B. Newman

Caltech

A data thunderstorm is gathering on the horizon with the next generation of particle physics experiments. The amount of data is overwhelming. Even though the prime data from the CERN CMS detector will be reduced by a factor of more than 10⁷, it will still amount to over a Petabyte (10¹⁵ bytes) of data per year accumulated for scientific analysis. The task of finding rare events resulting from the decays of massive new particles in a dominating background is even more formidable. Particle physicists have been at the vanguard of data-handling technology, beginning in the 1940's with eye scanning of bubble-chamber photographs and emulsions, through decades of electronic data acquisition systems employing real-time pattern recognition, filtering and formatting, and continuing on to the PetaByte archives generated by modern experiments. In the future, CMS and other experiments now being built to run at CERN’s Large Hadron Collider expect to accumulate of order of 100 PetaBytes within the next decade.

The scientific goals and discovery potential of the experiments will only be realized if efficient worldwide access to the data is made possible. Particle physicists are thus engaged in large national and international projects that address this massive data challenge, with special emphasis on distributed data access. There is an acute awareness that the ability to analyze data has not kept up with its increased flow. The traditional approach of extracting data subsets across the Internet, storing them locally, and processing them with home-brewed tools has reached its limits. Something drastically different is required. Indeed, without new modes of data access and of remote collaboration we will not be able to effectively “mine” the intellectual resources represented in our distributed collaborations. Thus the projects we are working on explore and implement new ideas in this area that until now have only been discussed in a theoretical context. These ground-breaking projects include:

· Globally Interconnected Object Databases (Caltech/CERN/HP funded)[1]

· The Particle Physics Data Grid (DoE/NGI funded)[2]

· Models Of Networked Analysis at Regional Centres (CERN funded)[3]

· Accessing Large Data Archives in Astronomy and Particle Physics (NSF/KDI funded)[4]

To be as realistic as possible, the projects make use of large existing data sets from high energy and nuclear physics experiments. They will help to answer some important questions that include:

· How are we going to integrate the querying algorithms and other tools to speed up access to the distributed data?

· How are we going to cluster the data optimally for fast access?

· How can we optimize the clustering and querying of data distributed across continents?

· What dynamical re-clustering strategies should be used?

· How do we compromise between fully ordered (sequential) organization, and totally “anarchic”, random arrangements of the data?

The use of OO languages and Object persistency is fundamental in our current thinking: these technologies allow us to define, implement and store the physics objects and inter-relationships that we deal with. We can then express the highly complicated queries on the object store in order to extract the events and features of interest.

These research directions will very likely be taken up in other branches of science, and in large corporations: the ability to rapidly mine scientific data, and the use of smart query engines will be a fundamental part of daily research and education in the 21^st century.