Research Data Management in the Lab


  • Matthias Razum
  • Simon Einwächter
  • Rozita Fridman
  • Markus Herrmann
  • Michael Krüger
  • Norman Pohl
  • Frank Schwichtenberg
  • Klaus Zimmermann



OR2010, Research Data, Library and information sciences, DDC: 020


Research, especially in science, is increasingly data-driven (Hey & Trefethen, 2003). The obvious type of research data is raw data produced by experiments (by means of sensors and other lab equipment). However, other types of data are highly relevant as well: calibration and configuration settings, analyzed and aggregated data, data generated by simulations. Today, nearly all of this data is born-digital. Based on the recommendations for "good scientific practice", researchers are required to keep their data for a long time. In Germany, DFG demands 8-10 years for published results (Deutsche Forschungsgemeinschaft, 1998). Ideally, data should not only be kept and made accessible upon request, but be published as well - either as part of the publication proper, or as references to data sets stored in dedicated data repositories. Another emerging trend are data publication journals, e.g. the Earth System Science Data Journal ( In contrast to these high-level requirements, many research institutes still lack a well-established and structured data management. Extremely data-intense disciplines like high-energy physics or climate research have built powerful grid infrastructures, which they provide to their respective communities. But for most "small sciences", such complex and highly specialized compute and storage infrastructures are missing and may not even be adequate. Consequently, the burden of setting up a data management infrastructure and of establishing and enforcing data curation policies lie with each institute or university. The ANDS project has shown that this approach is even preferable over a central (e.g., national or discipline-specific) data repository (The ANDS Technical Working Group, 2007). However, delegating the task of proper data curation to the head of a department or a working group adds a huge workload to their daily work. At the same time, they typically have little training and experience in data acquisition and cataloging. The library has expertise in cataloging and describing textual publications with metadata, but typically lacks the disciplinespecific knowledge needed to assess the data objects in their semantic meaning and importance. Trying to link raw data with calibration and configuration data at the end of a project is challenging or impossible, even for dedicated "data curators" and researchers themselves. Consequently, researchers focus on their (mostly textual) publications and have no established procedures on how to cope with data objects after the end of a project or a publication (Helly, Staudigel, & Koppers, 2003). This dilemma can be resolved by acquiring and storing the data automatically at the earliest convenience, i.e. during the course of an experiment. Only at this point in time, all the contextual information is available, which can be used to generate additional metadata. Deploying a data infrastructure to store and maintain the data in a generic way helps to enforce organization-wide data curation policies. Here, repository systems like Fedora ( (Lagoze, Payette, Shin, & Wilper, 2005) or eSciDoc ( (Dreyer, Bulatovic, Tschida, & Razum, 2007) come into play. However, an organization-wide data management has only a limited added-value for the researcher in the lab. Thus, the data acquisition should take place in a non-invasive manner, so that it doesn't interfere with the established work processes of researchers and thus poses a minimal threshold to the scientist.