Tell me what it is, not how you use it!

Regarding research data management I have been telling people the importance of describing a data set as "what it is" as opposed to "how you use it".

An early example that was given to me by a biobanking expert in The Netherlands was in the description of a chest X-ray: most likely such an image can be used both to study the bone in the spine, and to study the state of the major arteries (e.g. the aorta). If such an X-ray is acquired, it is likely that it is for one purpose only, but that does not exclude a re-use in another field. To optimize the re-usability of the data (see the FAIR principles), a chest X-ray should be labeled "chest X-ray" and not "X-ray of the spine" even if it was acquired for that specific goal.

I think this is similar to a "book cupboard". Most "book cupboards" are actually "book-size shelves". Those shelves can be used to store books, but can also used for other purposes. To optimize the Findability of the right storage solution in the shop, it would be useful if the label would not express a single use.

Recently I heard yet another very good example of the same principle in an episode of the SE-radio podcast: Function names when writing computer code. It really improves the human readability of computer software (and hence the maintainability) if each function is named after what it does, rather than how it is used. The example from the podcast: do not name the function "reformat_email", but name it "remove_double_newlines" if that is what it does.

I've heard someone say "Researchers are the worst judges on the possibilities for re-use of their own data". It is true: a researcher studying the aorta will not even see the spine on their own X-ray images, let alone think about ways in which the data can be reused by bone researchers. I think labeling a data set with how it is used is a consequence of this. A trained librarian/archivist, with training in classification systems, will quickly see through such a mistake and suggest better naming.


Who is supposed to benefit from the EOSC?

Story 1. The European Open Science Cloud had their launch meeting in Vienna on November 23, 2018. Lectures were representing big (hundreds of millions of Euros) single topic focused European science projects, telling the audience how important the EOSC will be.

Story 2. I like answering questions on Quora. Recently someone asked whether it is unhealthy to live in a humid climate. It took me 2 hours to find data on climate humidity and longevity, summarize both per country, and correlate. I could only find both numbers for 46 countries.

My opinion? The European Open Science Cloud will be most helpful to speed up the answers for small questions coupling different data sets from diverse sources. The big projects collecting their own petabytes of data will manage to do exactly the same with or without a science cloud. But most of science consists of much smaller questions, possibly composing data into solutions of grand societal goals. These kinds of projects are most served a lot by datasets that are adhering to standards. They will benefit from the EOSC. And that is why it is a pity that the voice of such projects was not represented at the launch event.

Is there a need for a research data management specialism?

Fire damaged chemical lab Hiroshima; from Wikimedia; public domain


There is an interesting difference between how risks are often approached in a research lab where a lot of data is handled and in a chemical lab. Many people working with data regularly encounter problems like not being able to locate data quickly or not being able to reproduce results exactly, but do often they think these problems are an integral consequence of working with large amounts of data, and do not recognize these are problems with the data management practice and preparation. The equivalent in a chemical lab would be researchers thinking that daily fires and explosions naturally belong to working with chemical compounds, rather than recognizing these as a consequence of bad lab practice and bad preparation.

There is also resistance to the uptake of a data management specialism because many researchers think that data management is relatively easy. Everybody has a computer at home, and many maintain photo libraries. However, this experience does not directly translate into work with large amounts of data in the lab:

  • Data in the lab is often 1-3 orders of magnitude larger than a photo library at home. A maintenance job that costs an hour for photolibrary would translate into more than 6 months of work in a large data-intensive project. Because of this, there is really a need for different approaches.
  • Data in a photo library consists of JPG files and maybe RAW files, and these files have simple 1 to 1 relationships. In the lab there are many more different kinds of data, and the relationships are much more complex.
  • A photo library is usually maintained by a single person. In the lab, the same data is worked on by different people, and they must each be aware of everything that is done by the others.

And in fact, even in a photo library at home one can not always quickly find what one is looking for.

No Solar Power

On December 27, 2014 during the night quite a nasty snow cover started to come down on a warm earth. At some point the snow stayed, but it stayed wet and icy. This managed to cover part of our solar panels in a thick layer of snow and ice. Since the cover was incomplete, the inverter refused to start and gave repeated errors. We ended the day with 0 Wh of power, a first for our installation. During the night, pieces of ice and snow kept coming down, giving rise to scary cracks and crashes on the roof.

December 28 I went to the roof around 13:00 to clean the last bits off with a broom: this was a dangerous operation from below: patches of ice and snow up to 50 kg crashed down on grass and driveway. The result of this action is quite apparent in the graph of solar power on that day!

Unstable operationsIncomplete snow cover

Please do ask questions at a lecture, except...

Via twitter, I saw a very cynical remark about asking questions after a scientific lecture with a flow diagram discouraging most people to ask anything at all. This does not at all correspond to my experience organizing symposia and conferences. Most of the time, questions are very welcome, and people are way too shy to share their visions. I therefore made a rebuttal in the form of the following flow diagram which I think is a better representation of the line of thought to follow.

Rotterdam CS renewed

New hall

Over the course of the last 4 years, Rotterdam Central Station has been completely renovated. Many times I have taken pictures (with my phone's camera, sorry), focusing on the area around track 1-3 where the action started and ended. For a documentation on four years of change, look at the pictures in my flickr account.

Reptile Zoo

PythonA few weeks ago I visited two reptile zoos together with Maxim: one in Breda and one in Tilburg. Click on the picture to see a sampling of a few pictures I took that day. I've been experimenting with some hand-held HDR pictures (three exposures each) at high ISO. All pictures have been post-processed through DXO Optics Pro 9, the HDR images have been aligned using Hugin and mapped using Luminance HDR.

Five star rating your own photos

Have you ever been wondering how to use the five stars in your photo catalog? I’ve heard people say: there are only two kinds of pictures: pictures you could show to someone, and pictures you wouldn’t show to anyone. Isn’t choosing between zero and one star enough?

Read more: Five star rating your own photos

Max OS-X: Shrink PDF files in the Finder

Today I finally got around to use Automator, and make one of the command line scripts I have been running for ages a little easier to use.

The problem I have been trying to solve is the fact that the PDF files that a Mac creates are normally of very high quality and hence large. If you just want to send someone a document for reading on screen, a much smaller PDF file would do. The open source package “ghostscript” has a tool called ps2pdf that can be (ab)used to adjust the size of components for PDF files. I installed this in /opt/local/bin using the “macports” software.

Read more: Max OS-X: Shrink PDF files in the Finder

Very odd ratio

I was reading some news when I noticed a mathematical curiosity. The article on a physics result mentioned a chance of “one in ten to the minus 7”. Of course this is a mistake: a small chance is either one in ten to the 7” or “ten to the minus 7”. The combination of “one in” and “minus” is nonsense. Interesting enough, this mistake is really common….

#math #oops