A Data Analysis Scenario using Virtual Data for the CMS Experiment

See the associated presentation made at the GriPhyN face to face meeting 9th. March 2000.

Introduction

The concept of Virtual Data is simply explained as data which may or may not exist in the data analysis system. An example is a collection of events that satisfy certain selection criteria. The collection is a virtual collection until becoming real by the execution of a task that requires the collection as input. At this point, if there is no physical instance of a collection that satisfies the selection criteria, one will be created by the execution of the selection criteria on the whole event sample. Another example is a virtual collection which is obtained from reclustering an event collection using, for example, sparse indexes on the events.

Commonly required event collections are likely to be already instantiated at various locations in the data analysis system. An example of a commonly required event collection for CMS will be events that have one or more isolated leptons.

Before describing the data analysis scenario, it is important to understand today's best guess at the composition and operation of the global data analysis system.

The system is a globally accessible ensemble of computing resources that includes

aggregates of computer systems (CPU, memory, disk, "tape" (DVD or tape), OS software, HSM software)
high speed networks (local and wide area, including international)
Grid-based middleware (authentication mechanisms, query/resource matchmaking services, data motion services, event collection catalogue)
data persistency management services (database management systems for events and physics objects)
analysis software (experiment wide and user specific)

These resources are elaborated on below:

Aggregates of Computers

These represent the basic computing resources of the collaboration, and as such must be allocated to tasks as efficiently as possible. A target utilisation of around 70% is generally considered to be optimal. In some cases the resources will be shared with third party users, and site policies will be required to ensure that the resources can be partitioned so that the collaboration obtains its agreed share. The aggregates are physically located at Tier i centres (-1<i<6 in current models). The Tiered centres are interconnected in a virtual Grid, practically implemented as network links.

High Speed Networks

Current models (ref. MONARC) expect and plan for WAN speeds of at least OC12 (622 Mbps), and much greater speeds in the LAN. Target network (bandwidth) utilisation of around 60% is generally considered to be optimal, and this will probably be guaranteed by QoS coupled with appropriate leasing arrangements with the infrastructure providers (ISPs).

Grid-based Middleware

The keystone of the global data analysis system is the Grid-based middleware. This is comprised of

Authentication mechanisms: Collaborators are authenticated by a central ticketing agency in the system that issues tickets allowing use of resources. The authentication mechanism takes account of quotas/priorities (such as for normalised CPU time, storage use on disk and "tape", bandwidth*time product on the WAN). Collaborators (users) with insufficient quota will be denied service until their quota is increased. Quotas are set by peer review amongst the computing resource providers and the collaboration management, although considerable flexibility is expected. Some quotas are team- or production-based, for example in the case of a high-priority physics search which is deemed to be of high priority for the collaboration.
Query/resource matchmaking service: Perhaps the most "clever" component of the system, and certainly that which requires the most research, development and prototyping. The matchmaking service will attempt to assign computing tasks to the available resources. The assignment will take into account a spectrum of factors, firstly pertaining to the task itself:
- Type, volume, existence and location of the (possibly virtual) input data
- Type, volume and destination of the virtual computed data
- Nature of the computation required (aggregate normalised CPU*seconds product)
and, given these, to the available resources:
- Availability and suitability of local compute resources to handle the task (including the input data and space provision for the computed data)
- Availability and suitability of remote compute resources to handle the task (as above)
- Prevailing network conditions
- Cost of locally or remotely creating the input data if virtual
- Cost of locally or remotely storing the computed data
- Cost of shipping the input data (once created, if virtual) to both the available local and remote computers
- Cost of executing the task locally or remotely
With these parameters, and a knowledge of existing and queued tasks in the system, the matchmaking service will provide an estimated cost and duration for the task for each of (zero) one or more possible execution locations. This information is sent back to the task submitter possibly as a set of tickets that contain the authentication parameters, and necessary routing and task submission semantics for each execution location. To execute the task, the submitter must have sufficient quota for whichever execution location (s)he subsequently chooses.
Data Movement Service: This component handles the movement of event collections amongst the Tier i centres. Invoked by tasks, the service allows synchronous or asynchronous copying or movement of data collections from one Tier centre to another. These motions are communicated to the event collection catalogue.
Event Collection Catalogue: This component maintains a database of locations, sizes, selection parameters and other details of the available event collections in the system. It is used by the matchmaking service to determine the existence or otherwise of an event collection required as input by a new task. It is informed of the existence of new data collections by the data persistency management services. It is informed of collection motions by the data motion service.

Data Persistency Management Services

The data persistency management services handle the access to and storage of the collaboration's data objects. The implementation of these services will typically be as ODBMS located at the Tier i centres. An additional service component handles communication with the Grid-based Event Collection Catalogue, informing the catalogue of the status and availability of event collections in the ODBMS. The operation, scaling behaviour, usability and storage overheads associated with the data persistency management services have been extensively researched and tested in the HEP community, and are now well understood.

Analysis Software

The analysis software is used by collaborators and/or production teams to derive physics results from the experimental data by analysing the properties of a collection of events. The algorithmic details of the analysis do not presently concern us; indeed, it is in any case impossible to predict an exhaustive list of possible analyses. What is highly probable is that the analysis software will use as input one or more collections of events. Each event in the collection will have different properties that depend on the underlying physics, manifested as identified particles, jets, missing energy, topology and other reconstructed quantities. These properties will be used in algorithms that typically determine if the event is of interest to the investigator(s) or not. Interesting events that survive the algorithmic selections will form a new collection. The statistical properties of the new collection, and sometimes the individual events themselves, are summarised typically in the form of histograms, graphs, tables or pictures destined for inclusion in publications.

Hierarchy of CMS Physics Data

The CMS experiment expects to collect around 1,000,000,000 events per year. Each event, as it emerges from the online data acquisition system, is a collection of raw data values from the various sub-detector elements of CMS, amounting to about 1 MByte of information in total. Each of these 1 MByte "RAW" events is processed by software that reconstructs the raw data into physics objects such as charged particle tracks and calorimeter energy clusters. The resulting "ESD" (Event Summary Data) amounts to roughly 100kByte per event. From the ESD information, a further process creates higher level physics objects such as particles, jets and missing energy, collectively known as Analysis Object Data (AOD). Each AOD event is of size 10kByte. Finally, each of the AOD events is summarised into a small set of salient characteristic quantities known as a TAG. The TAG events are approximately 100 Bytes in size.

In summary, per year:

1,000,000,000 RAW events of 1 MByte each = 1 PetaBytes
1,000,000,000 ESD events of 500 kByte each = 500 TeraBytes
1,000,000,000 AOD events of 10 kByte each = 10 TeraBytes
1,000,000,000 TAG events of 100 Bytes each = 100 GigaBytes

The events are organised into "collections": basically lists of pointers to TAG or AOD events that fall into a specific category. A group of collaborators working on a particular physics channel will define for themselves a collection of events that are believed to be in this channel. We expect there to be at least 20 active mainstream collections of this type, and potentially hundreds of smaller collections that belong to individual physicists.

Access to and Location of the Data

The global data analysis system is implemented on a Grid of computing resources located at a set of collaborating institutes and laboratories. The institutes and laboratories are arranged in a tiered hierarchy with CERN (the location of the experiment) the only Tier 0 Centre, a set of five or so Tier 1 Centres (typically national laboratories like FermiLab), 25 or so Tier 2 Centres (typically Universities with substantial compute resources dedicated to the experiment), and a hundred or so Tier 3 Centres which are typically department-sized installations belonging to HEP groups in Universities. The final (Tier 4) level corresponds to individual desktop computers.

The data are positioned at the Tier Centres as follows:

The RAW data will be kept at CERN: there are no serious plans to replicate to other Tier Centres.
The ESD probably will be replicated to each of the Tier 1 Centres, and perhaps to the Tier 2 Centres.
The AOD will be replicated to all Tier 1-3 Centres.
The TAG will be replicated to all Tier Centres.
Collections are created and stored throughout the Tier Centres.

Typical tasks foreseen include

individual physicist's data analysis jobs (at all Tier Centres)
analysis groups' selection jobs, definition of the standard analysis collections (at all Tier Centres)
reconstruction of RAW data at CERN, leads to creation of the new ESD, AOD and TAG data
re-reconstruction of ESD data at CERN, leads to creation of the new AOD and TAG data
replication of new ESD data from CERN to all Tier 1 Centres
replication of new AOD and TAG data from CERN to all Tier Centres

Data Analysis Scenario

We first consider the use of the global system by an individual end user called Fred, working at Caltech (a Tier 2 centre).

Fred is new to the CMS collaboration and wishes to get his feet wet by looking at all the Higgs -> two photon events taken so far by the experiment. He has developed a neural network algorithm that uses AOD and ESD data to eliminate the mis-identification of fast pi0 particles as prompt photons. He wants to apply this algorithm to every two photon event, accept or reject the event, and then make a private collection of the accepted events for further study.

After two years of running, the Higgs -> two photon sample is expected to comprise around 500,000 events. This sample is a standard CMS collection, and the collection is available at all the Tier Centres. The size of the collection is approximately 50 MBytes (500,000 TAG events).

One problem that Fred has not foreseen is that he uses ESD in his neural network algorithm. Unlike the TAG and AOD data, the ESD is too copious to be stored at all Tier Centres. Fred's job will fail as soon as it attempts to follow an association from an AOD event to an ESD event, if the ESD event is unavailable at the Tier Centre his job is running at.

Task template

Fred prepares an analysis task using a CMS-supplied template. This template task contains the following steps:

Opens the input event data collection
Declares an iterator over all events in the collection
Within the iteration loop, allows a callout to a user supplied module that applies an algorithm to determine whether the event satisfies the user's selection criteria
Within the iteration loop, allows the user to mark the event for inclusion in an output data collection
Closes the input and output data collections

Fred modifies the template so that his neural network module is called for each event. He then builds the completed application on his desktop machine, a 3 GHz Pentium VI running Blue Hat Linux 2.5.1/a. He then stores the executable in his home directory on the Caltech Tier 2 server.

Task Submission Wizard

Now, he starts up a Web-based form wizard that will guide him through the actual task submission. The wizard first helps him prepare some job control language (JCL) that will encapsulate his task submission to the global system. The JCL specifies the following:

SQL-like selection statement for the input data collections
- <input type="SQL">select * from TagDB where eventType = "Higgs" and nPhotons > 1 and nPhotons < 3;</input>
Location, type and speed of his executable (units of speed are e.g. events per SPECint95 seconds)
- <exe os="Linux" vendor="Blue Hat" version="2.5.1/a" speed="2.0">server.caltech.edu/~fred/Higgs/mysearch</exe>
Name for his output event collection
- <output type="Collection" access="private" selectivity="0.1">Fred's Higgs two photon candidates</output>

The input data collection selection could also have named a collection Fred is aware of, e.g. <input type="Collection">CMS Higgs Candidates (all channels)</input>

Matchmaking Service

Satisfied with the JCL, and still guided by the wizard, Fred requests that the task be submitted to the matchmaking service.

The wizard opens a connection to the matchmaking service Web server, and sends Fred's encrypted proxy id as authentication. The matchmaking service (perhaps executing on a server at FermiLab) identifies him, and then goes through the following steps:

parses the JCL
opens a connection to the Event Collection Catalogue
sends Fred's <input> JCL statement to the Event Collection Catalogue
receives back a (possibly null) list of N data collections that satisfy Fred's request
if N=0 then requests Event Collection Catalogue for next largest collection that is a superset of Fred's collection (tricky), sets N=1
builds a list of M Tier centres running Linux Blue Hat 2.5.1/a
builds a matrix of size N*M and fills it with the costs of moving each of the N collections to each of the M Linux resources
builds a matrix of size N*M and fills it with the expected wallclock time required to process all the events in each of the N collections at each of the M Linux servers, according to Fred's application's speed and the servers' CPU capacities
modifies and combines the cost matrices according to Fred's quota on the M servers and his bandwidth*time quota on the M*N network links
using Fred's estimated event selectivity, calculates the required disk space for Fred's output collection
modifies the cost matrix to reflect the cost at each of the M servers of storing Fred's output collection, or modifies the cost assuming that the output collection must be moved across the network to, and stored at, Fred's local Tier centre
eliminates all matrix elements with a cost greater than Fred's current quota

The matchmaking server returns a list of the eligible matched resources in order of ascending cost.

Task Submission and Ticketing

Fred is now in a position to decide whether or not to submit his task, and to which resource. He selects the cheapest matched resource, and the wizard informs the matchmaking service of his choice. Now the matchmaking service goes through the next steps:

calls the Data Movement service to begin transfers of the required data (if any)
submits the task to the selected resource using the job submission semantics required
generates a task ticket for Fred that is stamped with all pertinent details (task id, location, queue name etc.)
returns the task ticket to Fred

Fred is now in possesion of a unique task ticket that will enable him to monitor the progress of the task, and to cancel it or suspend it as necessary.

Task Execution

At some point later, the task begins execution on the assigned resource, a Linux cluster at Caltech. It attempts to open the input data collection. This requires contacting the local data persistency management services. The services will be interrogated for the existance of the required data collection. There are several possible scenarios:

The data collection is available in its entirety
The data collection is partially available (being moved in by the Data Movement Services or being created by a separate task)
The data collection is unavailable, but is scheduled to be available at some time in the future (being moved in or created)
The data collection is unavailable (this has to be considered as an error, since the task should not have been allocated to this resource without planned availability of the collection)

In case 1, the task begins execution normally. In case 2, if the task and persistency mechanism support streaming shared access, then the task can begin execution normally, otherwise a timeout or completion polling mechanism is required. In case 3, timeout or completion polling is required. In case 4 the task must abort.

Note that the local data persistency management services are also called upon to manage any new collections created by the task. These services will in turn inform the Event Collection Catalogue of the existence of Fred's output collection. As specified by the JCL, this collection will be marked as "private" and not visible to other normally privileged users.

Task Failure and Resubmission

In Fred's task, the iterator over all events in the two photon collection extracts each TAG object one at a time from the database. The TAG event object contains an association to the corresponding AOD event object. The task pulls the AOD object in turn from the database. The TAG event object also contains an association to the corresponding ESD event object. The task requires the ESD data as well, and attempts to pull this object from the database.

However, this fails, because the Caltech database does not contain the ESD: the association cannot be traversed. Fred's task fails with an informative error message indicating that ESD is not available for one or more events.

The use of the ESD is a special requirement that can only be met at a few Tier Centres. Fred must now work again in the Wizard to develop the necessary JCL that indicates the ESD requirement. Once this is done, the task description can again be sent to the Matchmaking Service for assignment.

Julian Bunn, March 2000