Is there a need for a research data management specialism?

Fire damaged chemical lab Hiroshima; from Wikimedia; public domain


There is an interesting difference between how risks are often approached in a research lab where a lot of data is handled and in a chemical lab. Many people working with data regularly encounter problems like not being able to locate data quickly or not being able to reproduce results exactly, but do often they think these problems are an integral consequence of working with large amounts of data, and do not recognize these are problems with the data management practice and preparation. The equivalent in a chemical lab would be researchers thinking that daily fires and explosions naturally belong to working with chemical compounds, rather than recognizing these as a consequence of bad lab practice and bad preparation.

There is also resistance to the uptake of a data management specialism because many researchers think that data management is relatively easy. Everybody has a computer at home, and many maintain photo libraries. However, this experience does not directly translate into work with large amounts of data in the lab:

  • Data in the lab is often 1-3 orders of magnitude larger than a photo library at home. A maintenance job that costs an hour for photolibrary would translate into more than 6 months of work in a large data-intensive project. Because of this, there is really a need for different approaches.
  • Data in a photo library consists of JPG files and maybe RAW files, and these files have simple 1 to 1 relationships. In the lab there are many more different kinds of data, and the relationships are much more complex.
  • A photo library is usually maintained by a single person. In the lab, the same data is worked on by different people, and they must each be aware of everything that is done by the others.

And in fact, even in a photo library at home one can not always quickly find what one is looking for.