See the associated presentation made at the GriPhyN face to face meeting 9th. March 2000.
The concept of Virtual Data is simply explained as data which may or may not exist in the data analysis system. An example is a collection of events that satisfy certain selection criteria. The collection is a virtual collection until becoming real by the execution of a task that requires the collection as input. At this point, if there is no physical instance of a collection that satisfies the selection criteria, one will be created by the execution of the selection criteria on the whole event sample. Another example is a virtual collection which is obtained from reclustering an event collection using, for example, sparse indexes on the events.
Commonly required event collections are likely to be already instantiated at various locations in the data analysis system. An example of a commonly required event collection for CMS will be events that have one or more isolated leptons.
Before describing the data analysis scenario, it is important to understand today's best guess at the composition and operation of the global data analysis system.
The system is a globally accessible ensemble of computing resources that includes
These resources are elaborated on below:
These represent the basic computing resources of the collaboration, and as such must be allocated to tasks as efficiently as possible. A target utilisation of around 70% is generally considered to be optimal. In some cases the resources will be shared with third party users, and site policies will be required to ensure that the resources can be partitioned so that the collaboration obtains its agreed share. The aggregates are physically located at Tier i centres (-1<i<6 in current models). The Tiered centres are interconnected in a virtual Grid, practically implemented as network links.
Current models (ref. MONARC) expect and plan for WAN speeds of at least OC12 (622 Mbps), and much greater speeds in the LAN. Target network (bandwidth) utilisation of around 60% is generally considered to be optimal, and this will probably be guaranteed by QoS coupled with appropriate leasing arrangements with the infrastructure providers (ISPs).
The keystone of the global data analysis system is the Grid-based middleware. This is comprised of
and, given these, to the available resources:
With these parameters, and a knowledge of existing and queued tasks in the system, the matchmaking service will provide an estimated cost and duration for the task for each of (zero) one or more possible execution locations. This information is sent back to the task submitter possibly as a set of tickets that contain the authentication parameters, and necessary routing and task submission semantics for each execution location. To execute the task, the submitter must have sufficient quota for whichever execution location (s)he subsequently chooses.
The data persistency management services handle the access to and storage of the collaboration's data objects. The implementation of these services will typically be as ODBMS located at the Tier i centres. An additional service component handles communication with the Grid-based Event Collection Catalogue, informing the catalogue of the status and availability of event collections in the ODBMS. The operation, scaling behaviour, usability and storage overheads associated with the data persistency management services have been extensively researched and tested in the HEP community, and are now well understood.
The analysis software is used by collaborators and/or production teams to derive physics results from the experimental data by analysing the properties of a collection of events. The algorithmic details of the analysis do not presently concern us; indeed, it is in any case impossible to predict an exhaustive list of possible analyses. What is highly probable is that the analysis software will use as input one or more collections of events. Each event in the collection will have different properties that depend on the underlying physics, manifested as identified particles, jets, missing energy, topology and other reconstructed quantities. These properties will be used in algorithms that typically determine if the event is of interest to the investigator(s) or not. Interesting events that survive the algorithmic selections will form a new collection. The statistical properties of the new collection, and sometimes the individual events themselves, are summarised typically in the form of histograms, graphs, tables or pictures destined for inclusion in publications.
The CMS experiment expects to collect around 1,000,000,000 events per year. Each event, as it emerges from the online data acquisition system, is a collection of raw data values from the various sub-detector elements of CMS, amounting to about 1 MByte of information in total. Each of these 1 MByte "RAW" events is processed by software that reconstructs the raw data into physics objects such as charged particle tracks and calorimeter energy clusters. The resulting "ESD" (Event Summary Data) amounts to roughly 100kByte per event. From the ESD information, a further process creates higher level physics objects such as particles, jets and missing energy, collectively known as Analysis Object Data (AOD). Each AOD event is of size 10kByte. Finally, each of the AOD events is summarised into a small set of salient characteristic quantities known as a TAG. The TAG events are approximately 100 Bytes in size.
In summary, per year:
The events are organised into "collections": basically lists of pointers to TAG or AOD events that fall into a specific category. A group of collaborators working on a particular physics channel will define for themselves a collection of events that are believed to be in this channel. We expect there to be at least 20 active mainstream collections of this type, and potentially hundreds of smaller collections that belong to individual physicists.
The global data analysis system is implemented on a Grid of computing resources located at a set of collaborating institutes and laboratories. The institutes and laboratories are arranged in a tiered hierarchy with CERN (the location of the experiment) the only Tier 0 Centre, a set of five or so Tier 1 Centres (typically national laboratories like FermiLab), 25 or so Tier 2 Centres (typically Universities with substantial compute resources dedicated to the experiment), and a hundred or so Tier 3 Centres which are typically department-sized installations belonging to HEP groups in Universities. The final (Tier 4) level corresponds to individual desktop computers.
The data are positioned at the Tier Centres as follows:
Typical tasks foreseen include
We first consider the use of the global system by an individual end user called Fred, working at Caltech (a Tier 2 centre).
Fred is new to the CMS collaboration and wishes to get his feet wet by looking at all the Higgs -> two photon events taken so far by the experiment. He has developed a neural network algorithm that uses AOD and ESD data to eliminate the mis-identification of fast pi0 particles as prompt photons. He wants to apply this algorithm to every two photon event, accept or reject the event, and then make a private collection of the accepted events for further study.
After two years of running, the Higgs -> two photon sample is expected to comprise around 500,000 events. This sample is a standard CMS collection, and the collection is available at all the Tier Centres. The size of the collection is approximately 50 MBytes (500,000 TAG events).
One problem that Fred has not foreseen is that he uses ESD in his neural network algorithm. Unlike the TAG and AOD data, the ESD is too copious to be stored at all Tier Centres. Fred's job will fail as soon as it attempts to follow an association from an AOD event to an ESD event, if the ESD event is unavailable at the Tier Centre his job is running at.
Fred prepares an analysis task using a CMS-supplied template. This template task contains the following steps:
Fred modifies the template so that his neural network module is called for each event. He then builds the completed application on his desktop machine, a 3 GHz Pentium VI running Blue Hat Linux 2.5.1/a. He then stores the executable in his home directory on the Caltech Tier 2 server.
Now, he starts up a Web-based form wizard that will guide him through the actual task submission. The wizard first helps him prepare some job control language (JCL) that will encapsulate his task submission to the global system. The JCL specifies the following:
The input data collection selection could also have named a collection Fred is aware of, e.g. <input type="Collection">CMS Higgs Candidates (all channels)</input>
Satisfied with the JCL, and still guided by the wizard, Fred requests that the task be submitted to the matchmaking service.
The wizard opens a connection to the matchmaking service Web server, and sends Fred's encrypted proxy id as authentication. The matchmaking service (perhaps executing on a server at FermiLab) identifies him, and then goes through the following steps:
The matchmaking server returns a list of the eligible matched resources in order of ascending cost.
Fred is now in a position to decide whether or not to submit his task, and to which resource. He selects the cheapest matched resource, and the wizard informs the matchmaking service of his choice. Now the matchmaking service goes through the next steps:
Fred is now in possesion of a unique task ticket that will enable him to monitor the progress of the task, and to cancel it or suspend it as necessary.
At some point later, the task begins execution on the assigned resource, a Linux cluster at Caltech. It attempts to open the input data collection. This requires contacting the local data persistency management services. The services will be interrogated for the existance of the required data collection. There are several possible scenarios:
In case 1, the task begins execution normally. In case 2, if the task and persistency mechanism support streaming shared access, then the task can begin execution normally, otherwise a timeout or completion polling mechanism is required. In case 3, timeout or completion polling is required. In case 4 the task must abort.
Note that the local data persistency management services are also called upon to manage any new collections created by the task. These services will in turn inform the Event Collection Catalogue of the existence of Fred's output collection. As specified by the JCL, this collection will be marked as "private" and not visible to other normally privileged users.
In Fred's task, the iterator over all events in the two photon collection extracts each TAG object one at a time from the database. The TAG event object contains an association to the corresponding AOD event object. The task pulls the AOD object in turn from the database. The TAG event object also contains an association to the corresponding ESD event object. The task requires the ESD data as well, and attempts to pull this object from the database.
However, this fails, because the Caltech database does not contain the ESD: the association cannot be traversed. Fred's task fails with an informative error message indicating that ESD is not available for one or more events.
The use of the ESD is a special requirement that can only be met at a few Tier Centres. Fred must now work again in the Wizard to develop the necessary JCL that indicates the ESD requirement. Once this is done, the task description can again be sent to the Matchmaking Service for assignment.
Julian Bunn, March 2000