Random Whispers We Heard - The terrible noise of breaking silence

Details: Published: 29 July 2023

data

The term “data outlier” is based on hidden assumptions. A completely different way to think about this is that they are points that do not fit your understanding of the distribution of errors that is underlying the data acquisition.

Unfortunately, we often falsely assume a “Normal” (Gaussian) distribution of errors. Did you know that in a “Normal” distribution a deviation of 11 sigma is much, much, much less likely than a deviation of 10 sigma? Does that correspond to your experience? Not mine: deviations of 11 sigma are about as likely as deviations of 10 sigma in practice. I see neither of these as outliers, they are just telling you that your error distribution is non-“Normal”.

In 1971, Abrahams and Keve (10.1107/S0567739471000305) described a beautiful way to verify the error model: sort the errors, and, based on an assumption that they follow a normal distribution, make a plot (Normal Probability Plot) of their value against their expected value. The resulting plot is expected to be a straight line. If it is not, this is telling you that the errors are not distributed following a Gaussian.

I suffered from this myself in my research. And for me, a very good solution was to replace the Normal distribution by a Student distribution (10.1107/S0108767309009908). The best parameter ν of that distribution can be derived by linearizing the probability plot. By following that procedure, it was no longer necessary for me to remove any “outliers”: all data points could be used in an analysis (10.1107/S0021889810018601).

Outliers don’t exist. If you think they do, you are probably misunderstanding your error model. And properly understanding your error model can teach you much more than you can learn from rejecting outliers through applying some empirical rule.

[This post was triggered by an AI-generated guide on handling outliers on linkedin.]

Details: Published: 02 September 2019

If you've ever done any project planning, you may know that Douglas Hofstadter, in his book Gödel, Escher, Bach, formulated an important law, named Hofstadter's law:

It always takes longer than you expect, even when you take into account Hofstadter's Law

In practice, this recursive law is very hard to apply. I use a non-recursive variant myself that also takes into account project planning:

A carefully planned development project takes π times longer than you think. If you don't plan carefully, it will be π²

Details: Published: 03 May 2019

data
musings

Regarding research data management I have been telling people the importance of describing a data set as "what it is" as opposed to "how you use it".

An early example that was given to me by a biobanking expert in The Netherlands was in the description of a chest X-ray: most likely such an image can be used both to study the bone in the spine, and to study the state of the major arteries (e.g. the aorta). If such an X-ray is acquired, it is likely that it is for one purpose only, but that does not exclude a re-use in another field. To optimize the re-usability of the data (see the FAIR principles), a chest X-ray should be labeled "chest X-ray" and not "X-ray of the spine" even if it was acquired for that specific goal.

I think this is similar to a "book cupboard". Most "book cupboards" are actually "book-size shelves". Those shelves can be used to store books, but can also used for other purposes. To optimize the Findability of the right storage solution in the shop, it would be useful if the label would not express a single use.

Recently I heard yet another very good example of the same principle in an episode of the SE-radio podcast: Function names when writing computer code. It really improves the human readability of computer software (and hence the maintainability) if each function is named after what it does, rather than how it is used. The example from the podcast: do not name the function "reformat_email", but name it "remove_double_newlines" if that is what it does.

I've heard someone say "Researchers are the worst judges on the possibilities for re-use of their own data". It is true: a researcher studying the aorta will not even see the spine on their own X-ray images, let alone think about ways in which the data can be reused by bone researchers. I think labeling a data set with how it is used is a consequence of this. A trained librarian/archivist, with training in classification systems, will quickly see through such a mistake and suggest better naming.

Details: Published: 30 March 2019

data
musings

Story 1. The European Open Science Cloud had their launch meeting in Vienna on November 23, 2018. Lectures were representing big (hundreds of millions of Euros) single topic focused European science projects, telling the audience how important the EOSC will be.

Story 2. I like answering questions on Quora. Recently someone asked whether it is unhealthy to live in a humid climate. It took me 2 hours to find data on climate humidity and longevity, summarize both per country, and correlate. I could only find both numbers for 46 countries.

My conclusion? A well-functioning European Open Science Cloud will be most helpful to speed up the answers for small questions by making it possible to couple different data sets from diverse sources. The big projects collecting their own petabytes of data will manage to do exactly the same with or without a science cloud. But most of science consists of much smaller questions, possibly composing data into solutions of grand societal goals. These kinds of projects are most served a lot by datasets that are adhering to standards. They will benefit from the EOSC. And that is why it is a pity that the voice of such projects was not represented at the launch event.

Details: Published: 11 February 2019

data
musings

Fire damaged chemical lab Hiroshima; from Wikimedia; public domain

Yes!

There is an interesting difference between how risks are often approached in a research lab where a lot of data is handled and in a chemical lab. Many people working with data regularly encounter problems like not being able to locate data quickly or not being able to reproduce results exactly, but do often they think these problems are an integral consequence of working with large amounts of data, and do not recognize these are problems with the data management practice and preparation. The equivalent in a chemical lab would be researchers thinking that daily fires and explosions naturally belong to working with chemical compounds, rather than recognizing these as a consequence of bad lab practice and bad preparation.

There is also resistance to the uptake of a data management specialism because many researchers think that data management is relatively easy. Everybody has a computer at home, and many maintain photo libraries. However, this experience does not directly translate into work with large amounts of data in the lab:

Data in the lab is often 1-3 orders of magnitude larger than a photo library at home. A maintenance job that costs an hour for photolibrary would translate into more than 6 months of work in a large data-intensive project. Because of this, there is really a need for different approaches.
Data in a photo library consists of JPG files and maybe RAW files, and these files have simple 1 to 1 relationships. In the lab there are many more different kinds of data, and the relationships are much more complex.
A photo library is usually maintained by a single person. In the lab, the same data is worked on by different people, and they must each be aware of everything that is done by the others.

And in fact, even in a photo library at home one can not always quickly find what one is looking for.

Details: Published: 10 May 2014

Via twitter, I saw a very cynical remark about asking questions after a scientific lecture with a flow diagram discouraging most people to ask anything at all. This does not at all correspond to my experience organizing symposia and conferences. Most of the time, questions are very welcome, and people are way too shy to share their visions. I therefore made a rebuttal in the form of the following flow diagram which I think is a better representation of the line of thought to follow.

Details: Published: 27 April 2014

Over the course of the last 4 years, Rotterdam Central Station has been completely renovated. Many times I have taken pictures (with my phone's camera, sorry), focusing on the area around track 1-3 where the action started and ended.

Read more: Reptile Zoo

Details: Published: 07 July 2013

The programming we choose to do in our team is like the neck in the hourglass representing life science research and technology:

There are many grains of sand above us. Those represent all the software tools developed in life science research.
There is a large void below us. This represents the need for widely applicable tools in the life sciences.
At the bottom there are also grains of sand. They are well settled. These represent the current technology: commercially available tools and well-serviced open source packages.

How does the sand get from the top to the bottom? Via the neck of development. It is narrow; only few academic tools make it to the bottom. The flow through the neck is powered by:

push: a few academic groups that have the capability and capacity to make their tools available
pull: a few companies that look far ahead and are able to see and use the potential of an academic tool

In our hourglass of life science tools, new sand is being added at the top all the time. And most of it overflows the beaker after a while. Some tools never deliver what the author thought they would do. Some are made to solve a single problem and rightfully abandoned when that is done. But many tools are published and left as orphans. Only a selection of tools that promise to be useful for a larger audience ever make it to the neck.

In practice, the neck is too narrow. There are many more valuable tools than are taken up. A team like ours can help to make the neck larger by making existing research tools applicable for wider use as a service to life scientists with a clear need (we call it professionalization). But it is sometimes hard to convince funding parties to pay for this. It is also hard to convince researchers to work on making their software better: professionalization does not generate new high-impact papers. We work on convincing the funding parties that it is better to professionalize existing successes than to reinvent them using research money. And we work on convincing the scientists that professionalization of their output will lead to higher citation scores on their existing publications.

Science wants novelty. And the current Dutch finance climate is directed towards applied science, towards innovation in society. Look at the picture, and you can see that these are hard to combine. Innovation starts where novelty ends. The only way to make the combination is to include development.

Photo by graymalkn on flickr

Details: Published: 29 June 2013

musings

Have you ever been wondering how to use the five stars in your photo catalog? I’ve heard people say: there are only two kinds of pictures: pictures you could show to someone, and pictures you wouldn’t show to anyone. Isn’t choosing between zero and one star enough?

Page 1 of 12

Data outliers do not exist

The pie of project planning

Tell me what it is, not how you use it!

Who is supposed to benefit from the EOSC?

Is there a need for a research data management specialism?

Please do ask questions at a lecture, except...

Rotterdam CS renewed

Reptile Zoo

An Hourglass representing Research and Technology

Five star rating your own photos

Subjects