Folks, Thanks a lot for coming to the informal meeting today! - I think it was very useful to share the ideas and discuss different conceptual issues. I thought I'd try to summarize some points that were made during the streaming discussion today. Perhaps some of the points I am going to discuss here are biased or misrepresented, but I thought it would be useful to at least write a document that summarizes my perception. Please, comment on what I wrote, especially on the parts you disagree with or the parts that I have unintentionally missed. I tried to concentrate on conceptual design issues, leaving implementation details for the later discussion. The latter eventually will become very important, but in my view, we really need to define the physics requirements and outline the conceptual design first. Greg ---------------------------------------------------------------- PHYSICS REQUIREMENTS: There are two main requirements from the point of view of physics analysis: 1) Cross-section-conscious ("precision" in what follows) analyses should have a way to obtain the EXACT luminosity for the final samples used in the analyses, provided that they follow certain production rules imposed by the streaming design. It is foreseen that in order to satisfy this requirement, user might be asked to remove certain (small) fraction of the events from the final sample. 2) Luminosity-conscious analyses ("searches" in what follows) should have a way to obtain REASONABLY precise luminosity information without removing ANY events from the final samples. Here, reasonable precision is such that an additional luminosity error due to lost luminosity blocks is much less than the nominal error on the luminosity, i.e. less than about 1%. Additionally, users would like to have an ability to request the following: a) Luminosity for a specific range of runs for a specific trigger (again, precision and search analyses will get different type of luminosity information). b) Incremental list of events, satisfying the data stream definition used in the analysis, that previously were unavailable (have not been processed/collected or belonged to missing and since recovered partitions). c) For the searches, especially the ones of high priority or with rare objects in them (e.g., high-pT leptons or photons), we would like to minimize the amount of files/tapes required to be accessed for a reprocessing at any level (RAW, DST, TMB). d) The events/files/streams that are accessed often enough should be kept disk-resident (caching). e) It is desirable, although not required, to have an ability to define logical streams offline, although the data access model might not be as optimized as when using the predefined streams. Nevertheless, user should have an ability to get the events and the luminosity for any such logical stream. f) Under no circumstances a user will be fed duplicate events when working with one of the standard streams. If some data is recovered, only the incremental part of recovered files is added to the data hierarchy. CONCEPTUAL CONSTRAINTS AND FEATURES: Although some of the below conceptual constraints might be lifted if we find ways around them, the following contraints were discussed as desirable: 1) Each event is written only once - although not necessary, this concept is driven by the lack of money to provide additional tapes. However, if e.g., the tape failure rate becomes problematic, this constraint might have to be lifted. For now, however, we restrict ourselves to the EXCLUSIVE streaming. 2) System should have a capability to work with the events coming in a random order within a run, and perhaps even across the runs, although were possible sorting should be implemented when merging files, to increase the efficiency of the data access and reduce the database overhead. 3) System should support a capability of merging certain files to reduce SAM overhead in storing large number of very small files. 4) System should provide a full inheritance chain information for any given file, all the way to the RAW data. 5) System should support certain number of "certified" levels of data representation, with the lowest level being TMB or, in some cases, certified r-tuples, i.e. r-tuples produced within the standard production framework, which satisfies certain important constraints. For these certified data levels full luminosity information is available both for precision analyses and searches. Moreover, any incremental data recovery and new collider data are automatically propagated down to the very bottom level. 6) System should support user-defined levels of data representation, such as private r-tuples. However, since there is no real way to force all the users to produce private r-tuples within the same strict rules used for certified r-tuples, user is not guaranteed that the luminosity information for these private streams is necessarily accurate. Neither the recovered data files nor new files will be automatically propagated all the way down to the user levels. If needed, a user can run his/her code and automatically process either only the incremental changes to the stream since the last date when the code was run, or the full data stream. 7) Certain users (Physics Groups representatives?) should be allowed to define their own certified data streams, automatically updated for new and recovered files, with well defined luminosity. The code used to produce these streams should be certified and should satisfy all the constraints imposed on the rest of the production code. 8) It seems to be very convenient to keep a database of the total luminosity per run, per trigger for the entire data set, as well as a table of missing luminosity blocks and the integrated luminosity per trigger associated with these missing blocks. Such a database could tremendously simplify luminosity calculations for both precision measurements or searches, as well as aid in studying luminosity-dependence of certain analysis parameters. 9) It seems to be desirable to define special physical streams for precision analyses, perhaps with more restrictive rules than for streams primarily used for searches. Particularly, the split of luminosity-sensitive triggers (W, Z, some tops triggers) across several streams should be minimized, with a goal to keep one (or less!) stream per each trigger used for precision analyses. 10) For a trigger split across several physical streams, it is necessary to have a mechanism of marking luminosity blocks that are missing from at least one of these streams as "dead" or "suspicious." 11) File boundaries must coincide with the luminosity block boundaries, i.e. no luminosity block for a particular physical stream is allowed to be split across two or more files. This is more or less all I remember from the discussion. Let's think about the above items, which hopefully could help us to properly define the following concepts and mapping between them: - data set - physical stream - logical stream - file - tape - disk-resident stream I am looking forward to your comments on this (very preliminary) write-up. Thanks, Greg