Tell me what it is, not how you use it!

Regarding research data management I have been telling people the importance of describing a data set as "what it is" as opposed to "how you use it".

An early example that was given to me by a biobanking expert in The Netherlands was in the description of a chest X-ray: most likely such an image can be used both to study the bone in the spine, and to study the state of the major arteries (e.g. the aorta). If such an X-ray is acquired, it is likely that it is for one purpose only, but that does not exclude a re-use in another field. To optimize the re-usability of the data (see the FAIR principles), a chest X-ray should be labeled "chest X-ray" and not "X-ray of the spine" even if it was acquired for that specific goal.

I think this is similar to a "book cupboard". Most "book cupboards" are actually "book-size shelves". Those shelves can be used to store books, but can also used for other purposes. To optimize the Findability of the right storage solution in the shop, it would be useful if the label would not express a single use.

Recently I heard yet another very good example of the same principle in an episode of the SE-radio podcast: Function names when writing computer code. It really improves the human readability of computer software (and hence the maintainability) if each function is named after what it does, rather than how it is used. The example from the podcast: do not name the function "reformat_email", but name it "remove_double_newlines" if that is what it does.

I've heard someone say "Researchers are the worst judges on the possibilities for re-use of their own data". It is true: a researcher studying the aorta will not even see the spine on their own X-ray images, let alone think about ways in which the data can be reused by bone researchers. I think labeling a data set with how it is used is a consequence of this. A trained librarian/archivist, with training in classification systems, will quickly see through such a mistake and suggest better naming.

 

Write comment (0 Comments)

Who is supposed to benefit from the EOSC?

Story 1. The European Open Science Cloud had their launch meeting in Vienna on November 23, 2018. Lectures were representing big (hundreds of millions of Euros) single topic focused European science projects, telling the audience how important the EOSC will be.

Story 2. I like answering questions on Quora. Recently someone asked whether it is unhealthy to live in a humid climate. It took me 2 hours to find data on climate humidity and longevity, summarize both per country, and correlate. I could only find both numbers for 46 countries.

My opinion? The European Open Science Cloud will be most helpful to speed up the answers for small questions coupling different data sets from diverse sources. The big projects collecting their own petabytes of data will manage to do exactly the same with or without a science cloud. But most of science consists of much smaller questions, possibly composing data into solutions of grand societal goals. These kinds of projects are most served a lot by datasets that are adhering to standards. They will benefit from the EOSC. And that is why it is a pity that the voice of such projects was not represented at the launch event.

Write comment (0 Comments)

Is there a need for a research data management specialism?

Fire damaged chemical lab Hiroshima; from Wikimedia; public domain

Yes!

There is an interesting difference between how risks are often approached in a research lab where a lot of data is handled and in a chemical lab. Many people working with data regularly encounter problems like not being able to locate data quickly or not being able to reproduce results exactly, but do often they think these problems are an integral consequence of working with large amounts of data, and do not recognize these are problems with the data management practice and preparation. The equivalent in a chemical lab would be researchers thinking that daily fires and explosions naturally belong to working with chemical compounds, rather than recognizing these as a consequence of bad lab practice and bad preparation.

There is also resistance to the uptake of a data management specialism because many researchers think that data management is relatively easy. Everybody has a computer at home, and many maintain photo libraries. However, this experience does not directly translate into work with large amounts of data in the lab:

  • Data in the lab is often 1-3 orders of magnitude larger than a photo library at home. A maintenance job that costs an hour for photolibrary would translate into more than 6 months of work in a large data-intensive project. Because of this, there is really a need for different approaches.
  • Data in a photo library consists of JPG files and maybe RAW files, and these files have simple 1 to 1 relationships. In the lab there are many more different kinds of data, and the relationships are much more complex.
  • A photo library is usually maintained by a single person. In the lab, the same data is worked on by different people, and they must each be aware of everything that is done by the others.

And in fact, even in a photo library at home one can not always quickly find what one is looking for.

Write comment (0 Comments)

No Solar Power

On December 27, 2014 during the night quite a nasty snow cover started to come down on a warm earth. At some point the snow stayed, but it stayed wet and icy. This managed to cover part of our solar panels in a thick layer of snow and ice. Since the cover was incomplete, the inverter refused to start and gave repeated errors. We ended the day with 0 Wh of power, a first for our installation. During the night, pieces of ice and snow kept coming down, giving rise to scary cracks and crashes on the roof.

December 28 I went to the roof around 13:00 to clean the last bits off with a broom: this was a dangerous operation from below: patches of ice and snow up to 50 kg crashed down on grass and driveway. The result of this action is quite apparent in the graph of solar power on that day!

Unstable operationsIncomplete snow cover

Write comment (0 Comments)

Please do ask questions at a lecture, except...

Via twitter, I saw a very cynical remark about asking questions after a scientific lecture with a flow diagram discouraging most people to ask anything at all. This does not at all correspond to my experience organizing symposia and conferences. Most of the time, questions are very welcome, and people are way too shy to share their visions. I therefore made a rebuttal in the form of the following flow diagram which I think is a better representation of the line of thought to follow.

Write comment (0 Comments)

Reptile Zoo

A few weeks ago I visited two reptile zoos together with Maxim: one in Breda and one in Tilburg. I've been experimenting with some hand-held HDR pictures (three exposures each) at high ISO. All pictures have been post-processed through DXO Optics Pro 9, the HDR images have been aligned using Hugin and mapped using Luminance HDR.

Read more: Reptile Zoo

Write comment (0 Comments)

An Hourglass representing Research and Technology

The programming we choose to do in our team is like the neck in the hourglass representing life science research and technology:

  • There are many grains of sand above us. Those represent all the software tools developed in life science research.
  • There is a large void below us. This represents the need for widely applicable tools in the life sciences.
  • At the bottom there are also grains of sand. They are well settled. These represent the current technology: commercially available tools and well-serviced open source packages.

How does the sand get from the top to the bottom? Via the neck of development. It is narrow; only few academic tools make it to the bottom. The flow through the neck is powered by:

  • push: a few academic groups that have the capability and capacity to make their tools available
  • pull: a few companies that look far ahead and are able to see and use the potential of an academic tool

In our hourglass of life science tools, new sand is being added at the top all the time. And most of it overflows the beaker after a while. Some tools never deliver what the author thought they would do. Some are made to solve a single problem and rightfully abandoned when that is done. But many tools are published and left as orphans. Only a selection of tools that promise to be useful for a larger audience ever make it to the neck.

In practice, the neck is too narrow. There are many more valuable tools than are taken up. A team like ours can help to make the neck larger by making existing research tools applicable for wider use as a service to life scientists with a clear need (we call it professionalization). But it is sometimes hard to convince funding parties to pay for this. It is also hard to convince researchers to work on making their software better: professionalization does not generate new high-impact papers. We work on convincing the funding parties that it is better to professionalize existing successes than to reinvent them using research money. And we work on convincing the scientists that professionalization of their output will lead to higher citation scores on their existing publications.

Science wants novelty. And the current Dutch finance climate is directed towards applied science, towards innovation in society. Look at the picture, and you can see that these are hard to combine. Innovation starts where novelty ends. The only way to make the combination is to include development.

Photo by graymalkn on flickr

Write comment (0 Comments)

Max OS-X: Shrink PDF files in the Finder

Today I finally got around to use Automator, and make one of the command line scripts I have been running for ages a little easier to use.

The problem I have been trying to solve is the fact that the PDF files that a Mac creates are normally of very high quality and hence large. If you just want to send someone a document for reading on screen, a much smaller PDF file would do. The open source package “ghostscript” has a tool called ps2pdf that can be (ab)used to adjust the size of components for PDF files. I installed this in /opt/local/bin using the “macports” software.

Read more: Max OS-X: Shrink PDF files in the Finder

Write comment (0 Comments)