Mining the data mountain: supercomputer analyses and filters existing knowledge

Which crop grows best where? Which species all have gene X? These are complex questions and we have never been sure how to tackle them. Feed all the available data into a supercomputer, however, and the answers will come rolling out.
Stijn van Gils

Text Stijn van Gils illustration Pascal Tieman

The amount of data obtained and stored in WUR has increased temendously in recent years. Where formerly a single plant would be meticulously studied to see what antibodies it makes, nowadays drones can use light reflection to estimate this for entire field of plants in real time. And at the same time, we can follow developments around the world more closely. For example, the Laboratory for Geo-information Science and Remote Sensing is working on a system that monitors where illegal felling is going on, and how much, on a daily basis.

Dick de Ridder, professor in the Bioinformatics chair group, has seen this rapid increase in the amount of data in his field too. ‘In 1988 a huge project started to map the entire human genome. The operation was completed in 2003. To date about 14,000 genomes have been completely mapped and another 85,000 almost completely.’

Too big for Excel

That gigantic mountain of date makes it possible to answer new questions, explains De Ridder. ‘Questions of an entirely different order. For example: which species have one particular gene and which other genes does that go together with?’ Other WUR groups are also trying to make more use of big data – datasets that are much too big for Excel. Animal Sciences, for example, is working a lot with machine learning, a technique with which a computer more or less autonomously searches great mountains of data for patterns. Other chair groups such as Genetics and Bioinformatics know a lot about this too. ‘It’s just that the knowledge and expertise is still scattered through the organization,’ says Willem Jan Knibbe, head of the Wageningen Data Competence Center (WDCC, see inset).

The WDCC wants to tackle that fragmentation and use model projects to bring together existing knowledge about big data. ‘You can take simple questions,’ says Knibbe, ‘such as the age-old question of which crop grows best where. Several WUR groups have been doing research on this for years. Researchers have made growth models, set up experiments on different soils, and made economic analyses. But you can only solve part of the puzzle with such data sets,’ explains Knibbe. ‘Because it is a broad question that depends on a wide range of factors. The soil is one, but there is also the proximity of factories, a market or infrastructure such as harbours and roads. This nice thing is that we’ve got all that information at WUR.’

Combining data

Social scientists at WUR work with the Global Detector, a market information system. In the Environmental Sciences, the AgroDataCube is under development: a big data portal which includes weather conditions on plots of land and information about crop growth. Plant Sciences has the Akkerweb platform, which combines information about soil conditions and disease pressure in order to provide farmers with recommendations. The WDCC tries to support these initiatives and create combinations where possible.

The WDCC also wants to look at methods that help us to understand food security risks better. ‘Can we mainly expect risks related to transport, to certification, or to some other part of the process? There is a lot of data that can tell us something about this, and with which we can pinpoint what to look out for specifically,’ explains Knibbe. And what about methane emissions from barns? ‘Do we need to measure this per barn, or even per cow? Or can we estimate it adequately using already registered information?’


New methods are under development for analysing gigantic datasets, says Knibbe. The standard procedure is to work with a sample, getting measurements for part of a group under similar conditions. In the case of big data, there is often information about the entire group, but it was measured under varying conditions. ‘With this data we can discover all kinds of interesting relationships using Bayesian analysis or various methods of machine learning. It requires much greater computing power, beyond the capacity of an ordinary PC.’

WUR has its own central supercomputer that can make large-scale calculations. This ‘high-performance cluster’, which consists of several servers linked together, can make several calculations simultaneously which would otherwise have to be made consecutively. Recently, this has been used more intensively, says Dick de Ridder. The plan to stop charging chair groups for each calculation separately is a particularly good incentive, he thinks. ‘But the cluster is due to be written off soon, and the question is what we should do then.’

One option is to invest in a new one, but outsourcing to an external company would be another possibility. Some groups, such as Remote Sending, already do this because they use data from external parties and prefer to have the computing done on computers that are located close to those datasets. On the other hand, a shared high performance cluster of your own ensure that groups can give each tips and advice on using it, says Petra Caessens, manager at Shared Research Facilities. So she feels a shared supercomputer would be the best option. A decision has yet to be taken, however.

Tropical forest mapAsk the Data Desk
A nice example of combining datasets is Lucid (Land Use, Carbon & Emission Data), a world map showing the amount of biomass in tropical forest per hectare. The creators combined satellite data with a large number of field observations. Researchers can use the information about biomass to see where forest regeneration can lead to extra CO2 storage. The map can be seen on lucid.wur.nlLast Autumn, WUR set up the Wageningen Data Competence Centre to boost the use of big data. The WDCC supports chair groups in mining data, and is involved in education initiatives on big data. The centre inventories the available expertise in WUR and looks out for new possibilities for combining existing knowledge. Individual teachers and researchers can bring their questions to the Data Desk, which is partly managed by the WDCC.

Leave a Reply

You must be logged in to write a comment.