Open and synthetic data: New opportunities for industry

Column 03.12.2020

Latvia and Finland are aiming at being Champions in the area of open data. Indeed, Latvia is leading in the Baltic region while Finland is excelling in the Nordic region.

As public sector is creating large amounts of data from surveys, digital maps, and forestry related information, and as researchers supplement them with synthetic datasets, new data-drived solutions and business models are expected to emerge.

According to the European Data Maturity Report by EC, both Latvia and Finland are fast- trackers in the EU for advancing in open data and are among leaders in harvesting the local and regional portals of open data.

In 2016 Latvia made a Memorandum between the public sector and the ICT sector to become a Data Driven nation. An open data portal started operating in 2017, and Latvia has its own Open Data Strategy since last year. Latvia is also an active partner in the ideation process for the new European directives that will change the landscape in a few years.

But what if the data are not available and not enough datasets can be found to make an impact? There are some solutions available.

One of them is to generate synthetic data – for instance, create images with such software as 3D gaming engines. Synthetic data is a hot topic in deep learning, a subfield of Artificial Intelligence that has produced impressive results in the last decade. Deep neural networks (DNNs) demonstrate human-level performance or even exceed it on a broad range of tasks that used to be too challenging for computers, from playing chess to driving unmanned vehicles in complex environments.

However, their results come at a cost, as they require prohibitively large amount of data for learning.

That is why the leading players on the deep learning market, both in its applied sector and research, are corporations that have access to Big Data, such as Google, Facebook, Amazon, and Baidu.

And that is why smaller companies are at a disadvantage, because even if their R&D experts can come up with better algorithms, it is likely that their DNN models will still be outperformed by those of online giants that have access to the nearly unlimited amounts of data.

That’s where synthetic data come into play: instead of collecting the dataset for training a DNN model online and then assiduously labelling it, one can just generate it.

Of course, synthetic data will differ from the natural data: for instance, the images will look less authentic. However, one can easily make up for that by generating as many images as needed and modifying them at will. This approach is currently being explored at the Institute of Electronics and Computer Sciences (EDI), the highest-ranked scientific organization in the field of Engineering and Computer Science in Latvia.

Thus, in a recent project synthetic images of objects were used to train algorithms for automation of industrial robot arms, which now use cameras to detect, pick, and manipulate randomly piled objects. While taking and annotating pictures of such objects and piles would be a tedious and time-consuming task, the researchers obviated it by generating them.

Another application of synthetic data, recognizing and sorting plastic waste, is being developed as part of VIZTA (Vision, Identification, with Z-sensing Technology and key Applications) Horizon 2020 project. Instead of taking many pictures of bottles, just some were taken, and then the dataset was augmented with synthetic images of bottles.

Synthetic data are also being used for the design of self-driving cars in EU-funded PRYSTINE (Programmable Systems for Intelligence in Automobiles) project.

DNN models is the main component of the navigation system of self-driving cars; the train them, one needs lots of carefully annotated images of street views. Such images are difficult and costly to acquire and annotate, yet much easier to generate. It is expected that training DNNs of self-driving cars on mixed (that is, containing both authentic and synthetic images) datasets will make them more accurate and therefore more reliable.

We can see that fast advancements are possible once there is a clear vision and determination as well as belief that the data is an important enabler for startups and well-established companies. The use of open data is a good tool for gaining efficiencies and quality of the public and private sector and decision making that is based on the data.

Alise Barvika, Maksims Ivanovs and Roberts Kadikis

Investment and Development Agency of Latvia, Office in Helsinki

Institute of Electronics and Computer science, Latvia