11/01/2017 Opinion

‘Big data’: the infinitesimal detail of an almost infinite number of cases

Senior researcher

Francisco Lloret Maya

Ecology Professor at the Universitat Autònoma Barcelona (UAB, Spain) and researcher at CREAF. He is a member of the Executive Committee of the European Ecological Federation, the Sociedad Ibérica de Ecología and the

The almost infinite amount of data that we are capable of generating, known as 'big data', offers great opportunities but also great challenges for both Science and society in general.

Dear friend Ramon,

a few days ago you told me your interest in the business generated around 'big data'. I replied that I would search for you an article on that issue, but it is difficult to choose. The papers that use huge amounts of data have become more than common in trending scientific journals. In September 2016, Miraldo et al. published in Science a map of vertebrate genetic diversity on a global scale from 92,801 mitochondrial sequences of more than 450 species.

The magnitude of that information escapes any intuitive perception. We talk about megas, gigas and teras to tie those magnitudes short; after all, we only have ten fingers in our hands. The handling of enormous amounts of data (for which we have already coined a term, "big data") attracts scientists, public institutions (for security reasons, as well as electoral) and companies. You explained the interest of supermarket chains to use "big data" to build very tight customer profiles and offer them personalized deals.

Graph representing the metadata of thousands of archive documents, documenting the social network of hundreds of League of Nations personals. Author: Martin Grandjean (CC BY-SA 3.0)

While reviewing these magazines, I came to the idea of understanding where the current frontier of the empirical sciences lies. It is not a very novel idea: there are annual rankings and many books appear dealing with the subject. I concluded that the new territory to be explored by many disciplines is the one that opens thanks to the ability to obtain an infinitesimal detail of an almost infinite number of cases.

In biology, from the second half of the twentieth century, technical advances provided a very detailed description of selected objects. The electron microscope allowed visualizing the entrails of a cell or a handful of them. Subsequently, the development of genomic sequencing techniques has allowed to unravel all pieces of selected molecules of nucleic acids, particularly from humans. Neurobiologists are already able to identify the activity of individualized cells in complex neural networks. Simultaneously, the different disciplines that study nature obtained a complete view of the whole of its object of study. Cartographers already arrived to cover with mathematical rigor all the Earth in century XIX, although in unpopulated places the precision was scarce. In turn, biologists and geologists obtained reasonably complete inventories of plant and animal organisms, or geological structures, respectively.

The leap in knowledge we are currently experiencing represents a shift from great precision in the measurements of a few objects to great detail in almost all.

The leap in knowledge we are currently experiencing represents a shift from great precision in the measurements of a few objects to great detail in almost all. This thoroughness implies an enormous amount of information that could not be processed without the advances that have simultaneously occurred in computing. We have witnessed in a few years the spectacle that any large object of few square meters exposed to the outdoors anywhere in the world can be easily visualized from the sky on a screen. Unlike a single molecule that escapes the perception of our senses, we all recognize in those zenith images the habitat in which we take refuge, and that makes the technique more plausible.

More examples: researchers of University of Maryland have developed the Global Forest Change project which provides a detailed view of the loss or increase of forest area anywhere in the world. We also have techniques such as LIDAR - a kind of radar that uses laser beams —which, among other applications, makes it possible to discern the vegetation cover and its distance to the ground in centimeters. At the moment, we must select some small areas, but there is no intellectual obstacle that prevents us from thinking that we can get that information for all square centimeters of the entire Earth's surface. Nor is there any theoretical obstacle to genotype all organisms on Earth. As long as they let themselves be captured, of course, and here lies one of the key questions.

What has made this ordeal of data possible that does not cease? The trivial answer is to recognize that the techniques have been perfected. But that would not be enough if such technical solutions were expensive. We can find a simile in the economic world, where the marginal cost of certain services is approaching zero. This reduction of the marginal cost justifies the increase of scale in the business world. The same principle explains how the cost of genomic sequencing has been cheapened at least tenfold in a few years and the trend continues and expands to new features. The case of the images of the territory is curious because apparently it is free for the user. But only apparently, because the user also provides information of his, which happens to thicken the ‘big data’. In turn, some company finds that information valuable enough to pay for it. An interesting loop, which leads us to ask what is the limit to the use of such data that accumulate, when economic incentives arise and there seems to be no insurmountable technical constraints.

Nor is there any theoretical obstacle to genotype all organisms on Earth. As long as they let themselves be captured, of course.

These limitations may be ethical, for example when intrusion into privacy entails the benefit of a third party. But the situation is not so simple, since the benefits can be mutual. All that information can be useful to some companies, of course. But also for patients, when their doctors look for personalized treatments, or for the managers of the territory when they want to monitor their changes for collective benefit. But when the marginal cost of obtaining the data is not so small, there are some complications. For example, studies based on previously published data —comparative meta-analyzes, network analysis of interactions, environmental parameter models on a global scale, among others— are obviously proliferating with a relatively low cost.

That poses problems. As human resources in science are limited, the investment to obtain new basic information with quality —field observations or experiments— fades away. The reward for publishing local studies or specific experiments decreases relative to that received for doing global studies, although the number of data may be ridiculous and the inference questionable. Recently a colleague explained how an article increased its expectations of publication and recognition when a colored world map was drawn from less than twenty data. This situation is sought to be addressed with noble initiatives of popular science in which motivated and trained people contribute to provide abundant information base. But obtaining these data needs to be well designed and coordinated and the people involved must be sufficiently trained..

The reward for publishing local studies or specific experiments decreases relative to that received for doing global studies, although the number of data may be ridiculous and the inference questionable.

However, the greatest limitation to this data inflation probably comes from our intellectual ability to assimilate very detailed information of everything. If we were capable, we would not have invented science because we would already intuitively understand the world. Analytical alternatives, such as the probabilistic use of information, imply some simplification. An example: in a recent article in Science, Benson and colleagues propose to study complex networks by using a handful of modules that describe all possible connections between very few elements. Curiously we find ourselves on a back and forth trail in which we end up simplifying the huge amount of information collected. Of course, more rigorously and at a relatively affordable cost.

The Barcelona Supercomputing Center - National Supercomputing Center (BSC-CNS) holds the Marenostrum supercomputer, which reaches 110,000 billion operations per second. Author: Barcelona Supercomputing Center - National Supercomputing Center (BSC-CNS) (CC BY-ND 2.0)

However, the greatest limitation to this data inflation probably comes from our intellectual ability to assimilate very detailed information of everything. If we were capable, we would not have invented science because we would already intuitively understand the world.

Faced with the impossibility of explaining the contingency of each detail, and therefore reaching the reductionist panacea, holistic interpretations appear again, after they looked antiquated for decades, at least in ecology. That holism often recalls intuitive interpretations, in which the intellect processes the information in a little conscious way, we would say vulgarly without going into details. To make this holism minimally intelligible, we resort to concepts, such as information itself, which unfortunately are not immediately measurable by our senses. Biological evolution —another complex concept— has not prepared us too much for this. Evolution has trained us to perceive the size, the weight, the texture, the color or the heat of the objects. The advantage of using the ‘big data’ is that intuitive, holistic concepts —such as complexity—, that were relegated by many ecologists, can now be measured in some precise way, and thus be contrasted. It has been a long and entertaining journey in which the reductionist and holistic approaches seem to be able to shake hands, at least in ecology.

Finally, Ramon, I would like to remind J.L. Borges when he wrote about rigor in Science:

"In that Empire, the Art of Cartography achieved such Perfection that the Map of a single Province occupied an entire City, and the Map of the Empire, a whole Province. In time, these Boundless Maps did not satisfy and the Cartographers raised a Map of the Empire, which had the Size of the Empire and concurred punctually with it. Less Adicted to the Study of Cartography, the following Generations understood that this extended Map was Useless and not without Impiety they exposed it to the Inclemencies of the Sun and the Winters. In the Western Deserts, Ripped Pieces of the Map remain, inhabited by Animals and Beggars; In the whole country there is no other relic of the Geographical Disciplines

Suárez Miranda: Viajes de varones prudentes, libro cuarto, cap. XLV, Lérida, 1658.”