A Data Analysis Scenario using Virtual Data for the CMS Experiment

See the associated presentation made at the GriPhyN face to face meeting 9th. March 2000.

Introduction

The concept of Virtual Data is simply explained as data which may or may not exist in the data analysis system. An example is a collection of events that satisfy certain selection criteria. The collection is a virtual collection until becoming real by the execution of a task that requires the collection as input. At this point, if there is no physical instance of a collection that satisfies the selection criteria, one will be created by the execution of the selection criteria on the whole event sample. Another example is a virtual collection which is obtained from reclustering an event collection using, for example, sparse indexes on the events.

Commonly required event collections are likely to be already instantiated at various locations in the data analysis system. An example of a commonly required event collection for CMS will be events that have one or more isolated leptons.

Before describing the data analysis scenario, it is important to understand today's best guess at the composition and operation of the global data analysis system.

The system is a globally accessible ensemble of computing resources that includes

These resources are elaborated on below:

Aggregates of Computers

These represent the basic computing resources of the collaboration, and as such must be allocated to tasks as efficiently as possible. A target utilisation of around 70% is generally considered to be optimal. In some cases the resources will be shared with third party users, and site policies will be required to ensure that the resources can be partitioned so that the collaboration obtains its agreed share. The aggregates are physically located at Tier i centres (-1<i<6 in current models). The Tiered centres are interconnected in a virtual Grid, practically implemented as network links.

High Speed Networks

Current models (ref. MONARC) expect and plan for WAN speeds of at least OC12 (622 Mbps), and much greater speeds in the LAN. Target network (bandwidth) utilisation of around 60% is generally considered to be optimal, and this will probably be guaranteed by QoS coupled with appropriate leasing arrangements with the infrastructure providers (ISPs).

Grid-based Middleware

The keystone of the global data analysis system is the Grid-based middleware. This is comprised of

Data Persistency Management Services

The data persistency management services handle the access to and storage of the collaboration's data objects. The implementation of these services will typically be as ODBMS located at the Tier i centres. An additional service component handles communication with the Grid-based Event Collection Catalogue, informing the catalogue of the status and availability of event collections in the ODBMS. The operation, scaling behaviour, usability and storage overheads associated with the data persistency management services have been extensively researched and tested in the HEP community, and are now well understood.

Analysis Software

The analysis software is used by collaborators and/or production teams to derive physics results from the experimental data by analysing the properties of a collection of events. The algorithmic details of the analysis do not presently concern us; indeed, it is in any case impossible to predict an exhaustive list of possible analyses. What is highly probable is that the analysis software will use as input one or more collections of events. Each event in the collection will have different properties that depend on the underlying physics, manifested as identified particles, jets, missing energy, topology and other reconstructed quantities. These properties will be used in algorithms that typically determine if the event is of interest to the investigator(s) or not. Interesting events that survive the algorithmic selections will form a new collection. The statistical properties of the new collection, and sometimes the individual events themselves, are summarised typically in the form of histograms, graphs, tables or pictures destined for inclusion in publications.

Hierarchy of CMS Physics Data

The CMS experiment expects to collect around 1,000,000,000 events per year. Each event, as it emerges from the online data acquisition system, is a collection of raw data values from the various sub-detector elements of CMS, amounting to about 1 MByte of information in total. Each of these 1 MByte "RAW" events is processed by software that reconstructs the raw data into physics objects such as charged particle tracks and calorimeter energy clusters. The resulting "ESD" (Event Summary Data) amounts to roughly 100kByte per event. From the ESD information, a further process creates higher level physics objects such as particles, jets and missing energy, collectively known as Analysis Object Data (AOD). Each AOD event is of size 10kByte. Finally, each of the AOD events is summarised into a small set of salient characteristic quantities known as a TAG. The TAG events are approximately 100 Bytes in size.

In summary, per year:

The events are organised into "collections": basically lists of pointers to TAG or AOD events that fall into a specific category. A group of collaborators working on a particular physics channel will define for themselves a collection of events that are believed to be in this channel. We expect there to be at least 20 active mainstream collections of this type, and potentially hundreds of smaller collections that belong to individual physicists.

Access to and Location of the Data

The global data analysis system is implemented on a Grid of computing resources located at a set of collaborating institutes and laboratories. The institutes and laboratories are arranged in a tiered hierarchy with CERN (the location of the experiment) the only Tier 0 Centre, a set of five or so Tier 1 Centres (typically national laboratories like FermiLab), 25 or so Tier 2 Centres (typically Universities with substantial compute resources dedicated to the experiment), and a hundred or so Tier 3 Centres which are typically department-sized installations belonging to HEP groups in Universities. The final (Tier 4) level corresponds to individual desktop computers.

The data are positioned at the Tier Centres as follows:

Typical tasks foreseen include

Data Analysis Scenario

We first consider the use of the global system by an individual end user called Fred, working  at Caltech (a Tier 2 centre).

Fred is new to the CMS collaboration and wishes to get his feet wet by looking at all the Higgs -> two photon events taken so far by the experiment. He has developed a neural network algorithm that uses AOD and ESD data to eliminate the mis-identification of fast pi0 particles as prompt photons. He wants to apply this algorithm to every two photon event, accept or reject the event, and then make a private collection of the accepted events for further study.

After two years of running, the Higgs -> two photon sample is expected to comprise around 500,000 events. This sample is a standard CMS collection, and the collection is available at all the Tier Centres. The size of the collection is approximately 50 MBytes (500,000 TAG events).

One problem that Fred has not foreseen is that he uses ESD in his neural network algorithm. Unlike the TAG and AOD data, the ESD is too copious to be stored at all Tier Centres. Fred's job will fail as soon as it attempts to follow an association from an AOD event to an ESD event, if the ESD event is unavailable at the Tier Centre his job is running at.

Task template

Fred prepares an analysis task using a CMS-supplied template. This template task contains the following steps:

  1. Opens the input event data collection
  2. Declares an iterator over all events in the collection
  3. Within the iteration loop, allows a callout to a user supplied module that applies an algorithm to determine whether the event satisfies the user's selection criteria
  4. Within the iteration loop, allows the user to mark the event for inclusion in an output data collection
  5. Closes the input and output data collections

Fred modifies the template so that  his neural network module is called for each event. He then builds the completed application on his desktop machine, a 3 GHz Pentium VI running Blue Hat Linux 2.5.1/a. He then stores the executable in his home directory on the Caltech Tier 2 server.

Task Submission Wizard

Now, he starts up a Web-based form wizard that will guide him through the actual task submission. The wizard first helps him prepare some job control language (JCL) that will encapsulate his task submission to the global system. The JCL specifies the following:

The input data collection selection could also have named a collection Fred is aware of, e.g. <input type="Collection">CMS Higgs Candidates (all channels)</input>

Matchmaking Service

Satisfied with the JCL, and still guided by the wizard, Fred requests that the task be submitted to the matchmaking service.

The wizard opens a connection to the matchmaking service Web server, and sends Fred's encrypted proxy id as authentication. The matchmaking service (perhaps executing on a server at FermiLab) identifies him, and then goes through the following steps:

The matchmaking server returns a list of the eligible matched resources in order of ascending cost.

Task Submission and Ticketing

Fred is now in a position to decide whether or not to submit his task, and to which resource. He selects the cheapest matched resource, and the wizard informs the matchmaking service of his choice. Now the matchmaking service goes through the next steps:

Fred is now in possesion of a unique task ticket that will enable him to monitor the progress of the task, and to cancel it or suspend it as necessary.

Task Execution

At some point later, the task begins execution on the assigned resource, a Linux cluster at Caltech. It attempts to open the input data collection. This requires contacting the local data persistency management services. The services will be interrogated for the existance of the required data collection. There are several possible scenarios:

  1. The data collection is available in its entirety
  2. The data collection is partially available (being moved in by the Data Movement Services or being created by a separate task)
  3. The data collection is unavailable, but is scheduled to be available at some time in the future (being moved in or created)
  4. The data collection is unavailable (this has to be considered as an error, since the task should not have been allocated to this resource without planned availability of the collection)

In case 1, the task begins execution normally. In case 2, if the task and persistency mechanism support streaming shared access, then the task can begin execution normally, otherwise a timeout or completion polling mechanism is required. In case 3, timeout or completion polling is required. In case 4 the task must abort.

Note that the local data persistency management services are also called upon to manage any new collections created by the task. These services will in turn inform the Event Collection Catalogue of the existence of Fred's output collection. As specified by the JCL, this collection will be marked as "private" and not visible to other normally privileged users.

Task Failure and Resubmission

In Fred's task, the iterator over all events in the two photon collection extracts each TAG object one at a time from the database. The TAG event object contains an association to the corresponding AOD event object. The task pulls the AOD object in turn from the database. The TAG event object also contains an association to the corresponding ESD event object. The task requires the ESD data as well, and attempts to pull this object from the database.

However, this fails, because the Caltech database does not contain the ESD: the association cannot be traversed. Fred's task fails with an informative error message indicating that ESD is not available for one or more events.

The use of the ESD is a special requirement that can only be met at a few Tier Centres. Fred must now work again in the Wizard to develop the necessary JCL that indicates the ESD requirement. Once this is done, the task description can again be sent to the Matchmaking Service for assignment.

 


Julian Bunn, March 2000