Deployment and Use of Very Large Databases for
Next-generation Particle Physics Experiments
Julian J. Bunn and Harvey B. Newman
Caltech
A data thunderstorm is gathering on the horizon with the next
generation of particle physics experiments. The amount of data is overwhelming.
Even though the prime data from the CERN CMS detector will be reduced by a
factor of more than 107, it will still amount to over a Petabyte (1015
bytes) of data per year accumulated for scientific analysis. The task of
finding rare events resulting from the decays of massive new particles in a
dominating background is even more formidable. Particle physicists have been at
the vanguard of data-handling technology, beginning in the 1940's with eye
scanning of bubble-chamber photographs and emulsions, through decades of
electronic data acquisition systems employing real-time pattern recognition,
filtering and formatting, and continuing on to the PetaByte archives generated
by modern experiments. In the future, CMS and other experiments now being built
to run at CERN’s Large Hadron Collider expect to accumulate of order of 100
PetaBytes within the next decade.
The
scientific goals and discovery potential of the experiments will only be
realized if efficient worldwide access to the data is made possible. Particle
physicists are thus engaged in large national and international projects that
address this massive data challenge, with special emphasis on distributed data
access. There is an acute awareness that the ability to analyze data has not
kept up with its increased flow. The traditional approach of extracting data
subsets across the Internet, storing them locally, and processing them with
home-brewed tools has reached its limits. Something drastically different is
required. Indeed, without new modes of data access and of remote collaboration
we will not be able to effectively “mine” the intellectual resources
represented in our distributed collaborations. Thus the projects we are working
on explore and implement new ideas in this area that until now have only been
discussed in a theoretical context. These ground-breaking projects include:
·
Globally Interconnected Object Databases
(Caltech/CERN/HP funded)[1]
·
The Particle Physics Data Grid (DoE/NGI funded)[2]
·
Models Of Networked Analysis at Regional Centres
(CERN funded)[3]
·
Accessing Large Data Archives in Astronomy and
Particle Physics (NSF/KDI funded)[4]
To be as realistic as possible, the projects make use of
large existing data sets from high energy and nuclear physics experiments. They
will help to answer some important questions that include:
·
How are we going to integrate the querying
algorithms and other tools to speed up access to the distributed data?
·
How are we going to cluster the data optimally for
fast access?
·
How can we optimize the clustering and querying of
data distributed across continents?
·
What dynamical re-clustering strategies should be
used?
·
How do we compromise between fully ordered
(sequential) organization, and totally “anarchic”, random arrangements of the
data?
The use
of OO languages and Object persistency is fundamental in our current thinking:
these technologies allow us to define, implement and store the physics objects
and inter-relationships that we deal with. We can then express the highly
complicated queries on the object store in order to extract the events and
features of interest.
These research directions will very likely be taken up in other branches
of science, and in large corporations: the ability to rapidly mine scientific
data, and the use of smart query engines will be a fundamental part of daily
research and education in the 21st century.