Development Platform & Status Quo
As of November 2021, all data accessible via the website is still on an in-house solution based on MongoDB and GridFS. This system is designed for one site (@KIS) and physically consists of 2 computers with a total capacity (mirrored) of 150 TB net. It served as a development platform for data curation, grouping and injection of data from OT and as a backend for the existing search engine. Embargoes, programming interfaces to Python or IDL, and distributed data storage at multiple sites are not supported.
Scientific Requirements & Rationale
The next generation of instruments at the OT will achieve data rates in the range of 70 TB per day. Given several such instruments from different partners, and in anticipation of EST where the expected data rate is more in the range of 1 - 2 PB per day, the limitation to one site is no longer adequate. A data volume in this range requires a flexible keeping of partial data stocks at different locations with guaranteed redundancies and lifetimes.
Data injection from the different instruments, their initial calibration, their (redundant) distribution to various locations, and the generation of standard products must be automated.
Ideally, calculations should be performed on these data sets close to the data storage locations. If this is not possible, the data transport to the analysis location should be transparent to the user, considering the available resources, bandwidth, and the costs incurred.
Rucio & dCache
A possible product that meets the requirements described above and that has proven that it can handle data volumes beyond the above rates is Rucio. Rucio is a scientific data management system developed at CERN for experiments at the LHC. For the ATLAS Experiment alone, the data currently stored in this system (as of Nov. 2021) amount to 450 PB distributed across 120 sites.
Rucio is also gaining more and more followers in astronomy (ESCAPE, SKA, ...) and is being used for more and more projects with data volumes in the range described above. However, one problem for Rucio's use in astronomy is that Rucio does not inherently support data embargoes which are common in all areas of astronomy. For data from the OT, for example, an embargo period of at least one year after observation is planned. This requires (in addition to free access to data no longer subject to embargo) user authorisation and authorisation across all access opportunities (see below).
Rucio can handle a wide variety of storage systems and transport mechanisms commonly used in the scientific environment. We have chosen storage based on dCache, a source-open, free product that also originates from the field of particle physics and is under active development there.