Guidelines for Data ArchivesFollowing a simple set of rules makes it easier to integrate a data archive into any virtual observatory (VO). Below are rules proposed by the IAU Div. II Working Group on International Data Access; although the Working Group concentrates on access to solar and heliospheric data, the rules have been expressed as generically as possible and they have relevance to any archive and any VO – we urge data providers to follow them as far as possible.
The rules are grouped into two halves:
Although those in the second group are also in the province of the providers, following simple rules can make a lot of difference as to how easily the required observations can be found by the VO and supplied to the scientist.
As much data as practical should be made available. From an analysis standpoint, a "regular cadence" with a minimum number of several observations per hour (6+) is desirable; this would make it possible to track the general evolution of phenomena although rapid changes would be missed.
This document is open for discussion and we welcome comments.
Please contact us at:
Access Method:The protocol used for the interface into the archive is not critical – a virtual observatory should be able to handle whatever protocol the data provider adopts. Standard options include FTP, HTTP, Web Service, etc. – potentially the first two require least effort by the provider.
In relation to this, EGSO has developed the concept of resource-rich and resource-poor providers:
File Format:A virtual observatory should be able to accommodate the use of data in any file format. For quick-look purposes simple image files are adequate – e.g. JPEG, PNG, GIF, etc. However, the lack of metadata associated such formats with makes it difficult to use this type of file for serious research. If the objective is to compare data from different instruments then files with formats such as FITS, CDF or equivalent are strongly preferred; these should contain fully formed metadata – see below.
Processing of the data in the file need not be to a high level, but appropriate software and calibration information must be provided if data needs to be "manipulated" before use. As volume of data available increases, and the number of data sets grows, it is becoming increasingly important that the data be ready for use – i.e. calibrated – although this is by no means obligatory.
File Names & Metadata:There are no hard and fast rules on the file names but the name needs to be sufficiently unique that:
The SOHO mission developed a "convention" for the names of files in its summary and synoptic databases – see Naming Convention for Files (SOHO with BBSO extensions). A simpler convention might be sufficient, but this provides a gold standard for how things can be done.
Note that the file name on its own is not enough when the data are to be used for analysis. It is essential that all files contain good metadata describing in detail how the observations were made; if the metadata are not properly formed, it may be impossible to use the data in some circumstances. Again a "convention" was established during the time of SOHO – see Solarsoft Standard.
Directory Structure within the Archive:A hierarchical structure to the data directories makes it easier to find files and is strongly preferred. This is essential for resource-poor providers and is also beneficial for a data centre.
Ideally the directory structure should be a tree based on date (and time?):
yyyy/mm/dd yyyy/mm yyyy_week yyyy ...The number of directory levels really depends on number of files generated by the instrument. If only one file is produced per day, the number of levels of subdirectories can be reduced.
On Unix-based archives, if the directory structure is different to the one suggest above, it is possible to map to a more compliant structure using symbolic links without having to reorder the data themselves. The mapped directory structure can then be presented to the external interface.
Note: If it is not be possible to make all data available on-line, it is desirable to provide a catalogue that contains information on other data holdings. This route (via catalogues) could also be used to advertise proprietary data so other users at least know that the observations exist!
R.D. Bentley, UCL-MSSL
Revised March 2009