Crowdsourcing EO training datasets to improve cloud detection

Home > Blog > Crowdsourcing EO training datasets to improve cloud detection
Manual cloud classification app

The International Institute for Applied Systems Analysis (IIASA) has joined Sinergise to engage the public in an initiative involving ESA’s Sentinel-2 satellite imagery, and together we would like to improve EO training datasets to achieve better cloud detection and land cover algorithms.

A plethora of algorithms to distinguish clouds in multispectral satellite data are available. When using them as part of land cover change detection, simple (but fast) algorithms often fail due to many false-positives, which can sometimes have a significant impact on the end result. The ability to discriminate cloudy pixels is crucial for any automatic or semi-automatic solutions that detect land change. Therefore, we decided to try and develop algorithms that would be more suitable for this purpose. Because we want our services to be available globally, and because that means we need a very large database of training samples, we decided to engage the public's help.

To obtain a large data resource of curated cloud classification ssamples we used a number of tools, developed at IIASA, and Sentinel Hub services, which provide fast access to the entire global archive of Sentinel-2 data.

How does it work?

It's actually really simple. The application provides an image (e.g. 64x64 pixels), on which you delineate different types of clouds (opaque, thick, and thin clouds) in a paintbrush-like UI. The rest of the image will be implicitly cloud-free.

Help us collect the data and start using the application! The resulting data will be made available through the Geopedia portal, both for exploring and downloading.

Collecting other datasets

The approach will also allow us to collect other datasets in a rapid and efficient manner in the future. For example, using a slightly modified configuration, a similar workflow could be used to obtain a manually curated land cover classification data set, which could be used as training data for machine learning algorithms.