Precise data organisation facilitates work during the collection and processing of research data, data exchange and collaborative work, e.g. in cooperative projects. It makes sense to establish naming conventions for folders and files early on, especially if several parties are involved in a project.
A hierarchical structure is suitable for storing research data. For this purpose, think of meaningful categories, e.g. according to subprojects, time periods, type of file formats or the content of the files, and arrange them hierarchically. The naming of folders should be self-explanatory.
The file name should describe the content concisely and help to identify the data unambiguously. Information such as date, title, place of collection, project name or a version number are suitable for this purpose. For example, a file name could be structured as follows: YYMMDD_Title_Editor_Version.
The following rules should be observed when naming files:
- File names should be as precise as possible but descriptive.
- Special characters, spaces, punctuation marks or umlauts should not be used.
- Instead, capital letters and underscores should be used.
- Naming should be kept consistent, as capital letters affect sorting.
- Date formats should be given in the form YYMMDD.
- If numbers are given, they should always be two or three digits (e.g. Interview01 instead of Interview1).
- If different versions of a file are stored, a V with corresponding numbering (e.g. V01) should be given.
- Repetitions of information from folder names should be avoided in the file names.
It is often useful to keep previous states of files and to work with file versions in order to be able to track development stages and changes. This is especially true when several people are working on the same file. Versions that are no longer needed should be deleted if necessary.
A distinction should be made between manual and automatic procedures. A simple and clear method is to indicate the version directly in the file name. E.g. in the form of "V01". Alternatively, the information can be stored in the header or in standardised headers within the file itself.
In addition, there is specific software for version management, the use of which is particularly worthwhile for large projects that are stored centrally on a server. Widely used systems are Git and Subversion. For members of the University of Bamberg, the IT-Service provides GitLab for administration and versioning.
Documentation and metadata
Comprehensible documentation and description with metadata is indispensable for the publication and subsequent use of research data. This applies not only to re-use by third parties, but also to future use by the data creator themselves.
Research data are usually not self-explanatory, but require additional information - the metadata. Typical metadata includes information such as author or title. In addition, metadata contain information about the context in which the data was created, data cleansing measures, etc. Research data are often described with metadata only shortly before publication or archiving. However, a structured description offers added value earlier in the research process.
Without documentation, information can be lost over time, so that data can no longer be interpreted and understood due to missing contextual information. In addition, confusion can arise between different versions of files. Documentation is often the only form of communication between data creator and user, which is why it should be as comprehensive as possible. Furthermore, the findability of research data increases, as search engines look for metadata and not for the content of the data.
It is recommended to document at least the following information:
- Title of the data publication
- Producers, authors, rights holders
- Institution and project
- Year or period of creation
- Abstract/description of the data
- Structure of the data and their relationships to each other: how are the data structured, what do they contain; in the case of several data sets: how do they belong together, which data is needed to be able to interpret the other data?
- Method/data collection
- Measures for data cleaning or weighting
- Explanations for codes and labels (codebook)
- Version/version changes
- Reference to related publications that describe/evaluate the data set
In principle, it must be decided individually for each project which type of documentation is most suitable. In any case, it makes sense to have documentation that is both human- and machine-readable. If possible, machine-produced metadata, which may arise during the creation of the data, should be read and saved.
Documentation can be in different formats
- in a README file
- in an (electronic) lab book
- in a project-internal wiki
- within the folder structure and file naming
- in the file itself or in the file's meta information.
Well-designed and documented metadata play a central role in finding, searching and using research data. Therefore, think about data documentation at an early stage and pay attention to the requirements of metadata standards relevant to your subject and those of a repository suitable for publication at a later date.
Basic considerations that you can already make during planning or in the ongoing project are:
- Identify relevant metadata: What information is needed to track the data? What search and filter options would you like to have for the data?
- Determining the data collection process: At what point in time and in what form is the identified information available? Can it be generated automatically? What form of documentation is suitable for the ongoing research process? How can the metadata be meaningfully linked to the research data? Are there tools available for this?
- Determining the metadata format: How can the metadata be stored in the most structured way possible? Are there controlled vocabularies or ontologies? Where should the data be stored/published after project completion? Are there specific requirements of the repository or data archive intended for publication/archiving?
- Testing and improvement of the process: Is (partial) automation of the documentation possible?
Storing research data is an essential aspect of RDM. To prevent data loss, it makes sense to think about the storage location, the storage medium and a backup strategy.
Different storage locations have different advantages and disadvantages:
- Responsibility for security and backup lies with oneself
- maximum control
- PC and backup are connected, data recovery is not possible in case of loss
- Difficult for cooperative work
Mobile storage media
- Easy to transport
- Can be stored in a lockable cabinet or safe
- Insecure against loss and theft
- Contents must be encrypted separately
- External hard drives are susceptible to shock and wear and tear
Institutional storage locations
- regular backup
- professional implementation and maintenance
- Consideration of the institution’s data protection guidelines
- Speed may be too low
- Backup access may be delayed
- Security criteria may not always be transparent
External storage locations
- Easy to use and manage
- Backup available
- Easy to use for mobile work
- Professional implementation and maintenance
- Data protection issues often unresolved
- Security of the connection varies depending on the provider
- Dependence on internet connection
- Backup may be delayed
Where possible, research data should be stored in open file formats in addition to the original format to facilitate access to the information for subsequent users. Many file formats can be converted to open formats with little effort. In addition, open file formats allow archiving beyond the lifetime of specific software. According to good scientific practice, research data should be stored for at least 10 years. The following formats are suitable for this purpose:
Less recommendable formats
CSV, SPSS portable
TXT, HTML, PDF/A
MP4, WAV, AVI
TIFF, JPEG2000, PNG
XML, RDF, JSON