Although the Working Group concentrates on access to solar and heliospheric data, the rules
have been expressed as generically as possible and they have relevance to any
archive and any VO – we urge data providers to follow them as far as possible.
Although those in the second group are also in the province of the providers,
following simple rules can make a lot of difference as to how easily the
required observations can be found by the VO and supplied to the scientist.
As much data as practical should be made available.
From an analysis standpoint, a "regular cadence" with a minimum number of several observations
per hour (6+) is desirable; this would make it possible to track the general evolution of phenomena
although rapid changes would be missed.
Access Method:
The protocol used for the interface into a data archive is not critical –
a virtual observatory should be able to handle whatever protocol the data provider adopts.
Not all data providers can provide the same level of support – in this context,
EGSO developed the concept of resource-rich and resource-poor providers:
- Resource-rich providers – e.g. data centres – should be able to respond to requests through a simple interface.
For resource-rich providers, how the data are stored in an internal issue; catalogues can be used to determine exact access path, etc.
- For resource-poor providers, if the VO needs to find the data by itself, logically named files
within a hierarchical directory structure are desirable – see below.
Standard access options include FTP, HTTP, Web Service, etc.
– potentially the first two require least effort by the provider.
File Formats:
As volume of data available increases, and the number of data sets grows,
it is becoming increasingly important that the data be ready for use
– i.e. calibrated – although this is by no means obligatory.
A virtual observatory should be able to support the use of data in any format
although some file formats are more useful than others.
For quick-look purposes simple image files are adequate – e.g. JPEG, PNG, GIF, etc.
– but the lack of metadata associated such formats with makes it difficult
to use them for serious research.
If the objective is to compare data from different instruments, files with formats that
can contain fully formed metadata are strongly preferred –
e.g. FITS, CDF or equivalent.
If the data in file are not processed to a high level,
then appropriate software and calibration files must be provided if data needs
to be "manipulated" before use.
File Names & Metadata:
There are no hard and fast rules on the file names but the
name needs to be sufficiently unique that:
- The type and origin of the file can easily be identified, and
- It can exist without causing confusion when removed from the context of where it is normally
stored (on the source archive system)
Ideally the name should identify the "date & time" that the observations were made
and the "observatory & instrument" that made them –
an indication of the type of observation can also be useful.
The "date & time" need not be a full specification, some kind of
a sequential numbering might be sufficient.
However, if file naming is not based on time, a catalogue or simple
translation table is needed to allow the VO to select the appropriate file.
The SOHO mission developed a "convention" for the names of files in its summary and synoptic databases
– see
Naming Convention for Files
(SOHO with BBSO extensions).
A simpler convention might be sufficient, but this provides a gold standard for how things can be done.
Note that the information contained in the file name is not enough when the data are to be used for analysis;
it is essential that all files contain good metadata describing in detail how the observations were made.
It is also important that the metadata are properly formed
– if they are not it may be impossible to use the data in some circumstances.
Again a "convention" was established during the time of SOHO
– see Solarsoft Standard.
Directory Structure within the Archive:
A hierarchical structure to the data directories makes it easier to find files and is strongly preferred.
This is essential for resource-poor providers and is also beneficial for a data centre.
Ideally the directory structure should be a tree based on date (and time?):
yyyy/mm/dd
yyyy/mm
yyyy_week/
yyyy/
...
The number of directory levels really depends on number of files generated by the instrument.
If only one file is produced per day, the number of levels of subdirectories can be reduced.
On Unix-based archives, if the directory structure is different to the one suggest above,
it is possible to map to a more compliant structure using symbolic links without having to
reorder the data themselves. The mapped directory structure can then be presented to the
external interface.
Summary of Observations
It can greatly simplifies access if the archive maintains a summary of the
observations that have been made.
If an observing log is available a VO can determine what observations are
available without needing to search the archive directories looking for files.
- The observing log should contain the minimum information
that are required in the file metadata
although repeated information (such as the observatory
name and location?) could be abstracted into a header section.
- The observing log should also contain information explaining why there are gaps
in data coverage because of operational reasons, bad weather, etc.
(?? assume that day/night and radiation belts can be calculated?; eclipse season?)
If it is not possible for an archive to hold all its observations on-line,
the observing log can be used by a VO to identify that suitable observations
have been made – a request for the required data can then be generated.
This route could also be used to advertise the existence of proprietary
data so that other users at least know that the observations exist.
Revised Jan 2011, RDB