Div II WG: International Data Access
(Solar & Heliospheric Data)

Guidelines for Data Archives

Following a simple set of rules makes it easier to integrate a data archive into any virtual observatory (VO). Below are rules proposed by the IAU Div. II Working Group on International Data Access; although the Working Group concentrates on access to solar and heliospheric data, the rules have been expressed as generically as possible and they have relevance to any archive and any VO – we urge data providers to follow them as far as possible.

The rules are grouped into two halves:

Access Method and File Format
File Names & Metadata and Directory Structure within the Archive

Issues in the first group should primarily be decided by the providers and accommodated by the VO.

Although those in the second group are also in the province of the providers, following simple rules can make a lot of difference as to how easily the required observations can be found by the VO and supplied to the scientist.

As much data as practical should be made available. From an analysis standpoint, a "regular cadence" with a minimum number of several observations per hour (6+) is desirable; this would make it possible to track the general evolution of phenomena although rapid changes would be missed.

This document is open for discussion and we welcome comments. Please contact us at: iau-wgegso.org
(Note: You will need to change the @ image to a character if you cut and paste this address)

Access Method:

The protocol used for the interface into the archive is not critical – a virtual observatory should be able to handle whatever protocol the data provider adopts. Standard options include FTP, HTTP, Web Service, etc. – potentially the first two require least effort by the provider.

In relation to this, EGSO has developed the concept of resource-rich and resource-poor providers:

Resource-rich providers – e.g. data centres – should be able to respond to requests through a simple interface. For resource-rich providers, how the data are stored in an internal issue; catalogues can be used to determine exact access path...
For resource-poor, if VO needs to find the data itself then logically named files within a hierarchical directory structure are desirable – see below. Simple access through FTP or HTTP is the easiest to use.

File Format:

A virtual observatory should be able to accommodate the use of data in any file format. For quick-look purposes simple image files are adequate – e.g. JPEG, PNG, GIF, etc. However, the lack of metadata associated such formats with makes it difficult to use this type of file for serious research. If the objective is to compare data from different instruments then files with formats such as FITS, CDF or equivalent are strongly preferred; these should contain fully formed metadata – see below.

Processing of the data in the file need not be to a high level, but appropriate software and calibration information must be provided if data needs to be "manipulated" before use. As volume of data available increases, and the number of data sets grows, it is becoming increasingly important that the data be ready for use – i.e. calibrated – although this is by no means obligatory.

File Names & Metadata:

There are no hard and fast rules on the file names but the name needs to be sufficiently unique that:

The type and origin of the file can easily be identified, and
It can exist without causing confusion when removed from the context of where it is normally stored (on the source archive system)

Ideally the name should identify the "date/time" that the observations were made and the "observatory/instrument" that made them – an indication of the type of observation can also be useful. The "date/time" need not be a full specification, some kind of a sequential numbering might be sufficient. However, if file naming is not based on time, a catalogue or simple translation table is needed to allow the VO to select the appropriate file.

The SOHO mission developed a "convention" for the names of files in its summary and synoptic databases – see Naming Convention for Files (SOHO with BBSO extensions). A simpler convention might be sufficient, but this provides a gold standard for how things can be done.

Note that the file name on its own is not enough when the data are to be used for analysis. It is essential that all files contain good metadata describing in detail how the observations were made; if the metadata are not properly formed, it may be impossible to use the data in some circumstances. Again a "convention" was established during the time of SOHO – see Solarsoft Standard.

Directory Structure within the Archive:

A hierarchical structure to the data directories makes it easier to find files and is strongly preferred. This is essential for resource-poor providers and is also beneficial for a data centre.

Ideally the directory structure should be a tree based on date (and time?):

    yyyy/mm/dd
    yyyy/mm
    yyyy_week
    yyyy
    ...

The number of directory levels really depends on number of files generated by the instrument. If only one file is produced per day, the number of levels of subdirectories can be reduced.

On Unix-based archives, if the directory structure is different to the one suggest above, it is possible to map to a more compliant structure using symbolic links without having to reorder the data themselves. The mapped directory structure can then be presented to the external interface.

Note: If it is not be possible to make all data available on-line, it is desirable to provide a catalogue that contains information on other data holdings. This route (via catalogues) could also be used to advertise proprietary data so other users at least know that the observations exist!

Return to WG Home Page

R.D. Bentley, UCL-MSSL
Revised March 2009