Vector segmentation masks#

Clouds float by, water flows on; in movement there is no grasping, in Chan there is no settling

For ๐Ÿง‘โ€๐Ÿซ supervised machine learning, labels ๐Ÿท๏ธ are needed in addition to the input image ๐Ÿ–ผ๏ธ. Here, weโ€™ll step through an example workflow on matching vector ๐Ÿš label data (points, lines, polygons) to ๐Ÿ›ฐ๏ธ Earth Observation data inputs. Specifically, this tutorial will cover:

  • Reading shapefiles ๐Ÿ“ directly from the web via pyogrio

  • Rasterizing vector polygons from a geopandas.GeoDataFrame to an xarray.DataArray

  • Pairing ๐Ÿ›ฐ๏ธ satellite images with the rasterized label masks and feeding them into a DataLoader

๐ŸŽ‰ Getting started#

These are the tools ๐Ÿ› ๏ธ youโ€™ll need.

import matplotlib.pyplot as plt
import numpy as np
import planetary_computer
import pyogrio
import pystac
import torch
import torchdata
import xarray as xr
import zen3geo

0๏ธโƒฃ Find cloud-hosted raster and vector data โ›ณ#

In this case study, weโ€™ll look at the flood water extent over Johor, Malaysia ๐Ÿ‡ฒ๐Ÿ‡พ on 15 Dec 2019 that were digitized by ๐Ÿ‡บ๐Ÿ‡ณ UNITAR-UNOSATโ€™s rapid mapping service over Synthetic Aperture Radar (SAR) ๐Ÿ›ฐ๏ธ images. Specifically, weโ€™ll be using the ๐Ÿ‡ช๐Ÿ‡บ Sentinel-1 Ground Range Detected (GRD) productโ€™s VV polarization channel.

๐Ÿ”— Links:

To start, letโ€™s get the ๐Ÿ›ฐ๏ธ satellite scene weโ€™ll be using for this tutorial.

item_url = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-1-grd/items/S1A_IW_GRDH_1SDV_20191215T224757_20191215T224822_030365_037955"

# Load the individual item metadata and sign the assets
item = pystac.Item.from_file(item_url)
signed_item = planetary_computer.sign(item)
signed_item

This is how the Sentinel-1 ๐Ÿฉป image looks like over Johor in Peninsular Malaysia on 15 Dec 2019.

Sentinel-1 GRD image over Johor, Malaysia on 20191215

Load and reproject image data ๐Ÿ”„#

To keep things simple, weโ€™ll load just the VV channel into a DataPipe via zen3geo.datapipes.RioXarrayReader (functional name: read_from_rioxarray) ๐Ÿ˜€.

url = signed_item.assets["vv"].href
dp = torchdata.datapipes.iter.IterableWrapper(iterable=[url])
# Reading lower resolution grid using overview_level=3
dp_rioxarray = dp.read_from_rioxarray(overview_level=3)
dp_rioxarray
RioXarrayReaderIterDataPipe

The Sentinel-1 image from Planetary Computer comes in longitude/latitude ๐ŸŒ geographic coordinates by default (OGC:CRS84). To make the pixels more equal ๐Ÿ”ฒ area, we can project it to a ๐ŸŒ local projected coordinate system instead.

def reproject_to_local_utm(dataarray: xr.DataArray, resolution: float=80.0) -> xr.DataArray:
    """
    Reproject an xarray.DataArray grid from OGC:CRS84 to a local UTM coordinate
    reference system.
    """
    # Estimate UTM coordinate reference from a single pixel
    pixel = dataarray.isel(y=slice(0, 1), x=slice(0,1))
    new_crs = dataarray.rio.reproject(dst_crs="OGC:CRS84").rio.estimate_utm_crs()

    return dataarray.rio.reproject(dst_crs=new_crs, resolution=resolution)
dp_reprojected = dp_rioxarray.map(fn=reproject_to_local_utm)

Note

Universal Transverse Mercator (UTM) isnโ€™t actually an equal-area projection system. However, Sentinel-1 ๐Ÿ›ฐ๏ธ satellite scenes from Copernicus are usually distributed in a UTM coordinate reference system, and UTM is typically a close enough ๐Ÿค approximation to the local geographic area, or at least it wonโ€™t matter much when weโ€™re looking at spatial resolutions over several 10s of metres ๐Ÿ™‚.

Hint

For those wondering what OGC:CRS84 is, it is the longitude/latitude version of EPSG:4326 ๐ŸŒ (latitude/longitude). I.e., itโ€™s a matter of axis order, with OGC:CRS84 being x/y and EPSG:4326 being y/x.

๐Ÿ”– References:

Transform and visualize raster data ๐Ÿ”Ž#

Letโ€™s visualize ๐Ÿ‘€ the Sentinel-1 image, but before that, weโ€™ll transform ๐Ÿ”„ the VV data from linear to decibel scale.

def linear_to_decibel(dataarray: xr.DataArray) -> xr.DataArray:
    """
    Transforming the input xarray.DataArray's VV or VH values from linear to
    decibel scale using the formula ``10 * log_10(x)``.
    """
    # Mask out areas with 0 so that np.log10 is not undefined
    da_linear = dataarray.where(cond=dataarray != 0)
    da_decibel = 10 * np.log10(da_linear)
    return da_decibel
dp_decibel = dp_reprojected.map(fn=linear_to_decibel)
dp_decibel
MapperIterDataPipe

As an aside, weโ€™ll be using the Sentinel-1 image datapipe twice later, once as a template to create a blank canvas ๐ŸŽž๏ธ, and another time by itself ๐Ÿชž. This requires forking ๐Ÿด the DataPipe into two branches, which can be achieved using torchdata.datapipes.iter.Forker (functional name: fork).

dp_decibel_canvas, dp_decibel_image = dp_decibel.fork(num_instances=2)
dp_decibel_canvas, dp_decibel_image
(_ChildDataPipe, _ChildDataPipe)

Now to visualize the transformed Sentinel-1 image ๐Ÿ–ผ๏ธ. Letโ€™s zoom in ๐Ÿ”ญ to one of the analysis extent areas weโ€™ll be working on later.

it = iter(dp_decibel_image)
dataarray = next(it)

da_clip = dataarray.rio.clip_box(minx=371483, miny=190459, maxx=409684, maxy=229474)
da_clip.isel(band=0).plot.imshow(figsize=(11.5, 9), cmap="Blues_r", vmin=18, vmax=26)
<matplotlib.image.AxesImage at 0x7f9b7f559f10>
_images/db674069b1dcd8beea4ba5efb98c10b9a8ccc9b24116fb2372a15f32314b73a3.png

Notice how the darker blue areas ๐Ÿ”ต tend to correlate more with water features like the meandering rivers and the ๐Ÿš sea on the NorthEast. This is because the SAR ๐Ÿ›ฐ๏ธ signal which is side looking reflects off flat water bodies like a mirror ๐Ÿชž, with little energy getting reflected ๐Ÿ™… back directly to the sensor (hence why it looks darker โšซ).

Load and visualize cloud-hosted vector files ๐Ÿ’ #

Letโ€™s now load some vector data from the web ๐Ÿ•ธ๏ธ. These are polygons of the segmented ๐ŸŒŠ water extent digitized by UNOSATโ€™s AI Based Rapid Mapping Service. Weโ€™ll be converting these vector polygons to ๐ŸŒˆ raster masks later.

๐Ÿ”— Links:

# https://gdal.org/user/virtual_file_systems.html#vsizip-zip-archives
shape_url = "/vsizip/vsicurl/https://unosat-maps.web.cern.ch/MY/FL20191217MYS/FL20191217MYS_SHP.zip/ST1_20191215_WaterExtent_Johor_AOI2.shp"

This is a shapefile containing ๐Ÿ”ท polygons of the mapped water extent. Letโ€™s put it into a DataPipe called zen3geo.datapipes.PyogrioReader (functional name: read_from_pyogrio).

dp_shapes = torchdata.datapipes.iter.IterableWrapper(iterable=[shape_url])
dp_pyogrio = dp_shapes.read_from_pyogrio()
dp_pyogrio
PyogrioReaderIterDataPipe

This will take care of loading the shapefile into a geopandas.GeoDataFrame object. Letโ€™s take a look at the data table ๐Ÿ“Š to see what attributes are inside.

it = iter(dp_pyogrio)
geodataframe = next(it)
geodataframe.dropna(axis="columns")
/home/docs/checkouts/readthedocs.org/user_builds/zen3geo/envs/latest/lib/python3.11/site-packages/pyogrio/geopandas.py:49: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_datetime without passing `errors` and catch exceptions explicitly instead
  res = pd.to_datetime(ser, **datetime_kwargs)
Water_Clas Sensor_ID Sensor_Dat Area_m2 Area_ha StaffID EventCode geometry
0 Flood Water Sentinel-1 2019-12-15 2.178154e+07 2178.154127 JT FL20191217MYS MULTIPOLYGON (((104.01442 1.72298, 104.01335 1...

Cool, and we can also visualize the polygons ๐Ÿ”ท on a 2D map. To align the coordinates with the ๐Ÿ›ฐ๏ธ Sentinel-1 image above, weโ€™ll first use geopandas.GeoDataFrame.to_crs() to reproject the vector from ๐ŸŒ EPSG:9707 (WGS 84 + EGM96 height, latitude/longitude) to ๐ŸŒ EPSG:32648 (UTM Zone 48N).

print(f"Original bounds in EPSG:9707:\n{geodataframe.bounds}")
gdf = geodataframe.to_crs(crs="EPSG:32648")
print(f"New bounds in EPSG:32648:\n{gdf.bounds}")
Original bounds in EPSG:9707:
         minx      miny        maxx      maxy
0  103.844605  1.722825  104.187997  2.075693
New bounds in EPSG:32648:
            minx           miny           maxx           maxy
0  371483.597954  190459.328366  409683.869979  229473.310892

Plot it with geopandas.GeoDataFrame.plot(). This vector map ๐Ÿ—บ๏ธ should correspond to the zoomed in Sentinel-1 image plotted earlier above.

gdf.plot(figsize=(11.5, 9))
<Axes: >
_images/2d2d3978e1cc405676c7aafdbbf954ca4c104777257f9ee91bf5b458735bf307.png

Tip

Make sure to understand your raster and vector datasets well first! Open the files up in your favourite ๐ŸŒ Geographic Information System (GIS) tool, see how they actually look like spatially. Then youโ€™ll have a better idea to decide on how to create your data pipeline. The zen3geo way puts you as the Master ๐Ÿง™ in control.

1๏ธโƒฃ Create a canvas to paint on ๐ŸŽจ#

In this section, weโ€™ll work on converting the flood water ๐ŸŒŠ polygons above from a ๐Ÿšฉ vector to a ๐ŸŒˆ raster format, i.e. rasterization. This will be done in two steps ๐Ÿ“ถ:

  1. Defining a blank canvas ๐ŸŽž๏ธ

  2. Paint the polygons onto this blank canvas ๐Ÿง‘โ€๐ŸŽจ

For this, weโ€™ll be using tools from zen3geo.datapipes.datashader(). Letโ€™s see how this can be done.

Blank canvas from template raster ๐Ÿ–ผ๏ธ#

A canvas represents a 2D area with a height and a width ๐Ÿ“. For us, weโ€™ll be using a datashader.Canvas, which also defines the range of y-values (ymin to ymax) and x-values (xmin to xmax), essentially coordinates for every unit ๐Ÿ‡พ height and ๐Ÿ‡ฝ width.

Since we already have a Sentinel-1 ๐Ÿ›ฐ๏ธ raster grid with defined height/width and y/x coordinates, letโ€™s use it as a ๐Ÿ“„ template to define our canvas. This is done via zen3geo.datapipes.XarrayCanvas (functional name: canvas_from_xarray).

dp_canvas = dp_decibel_canvas.canvas_from_xarray()
dp_canvas
XarrayCanvasIterDataPipe

Cool, and hereโ€™s a quick inspection ๐Ÿ‘€ of the canvas dimensions and metadata.

it = iter(dp_canvas)
canvas = next(it)
print(f"Canvas height: {canvas.plot_height}, width: {canvas.plot_width}")
print(f"Y-range: {canvas.y_range}")
print(f"X-range: {canvas.x_range}")
print(f"Coordinate reference system: {canvas.crs}")
Canvas height: 2743, width: 3538
Y-range: (95200.47858355992, 314640.4785835599)
X-range: (247773.19092160053, 530813.1909216006)
Coordinate reference system: EPSG:32648

This information should match the template Sentinel-1 dataarray ๐Ÿ.

print(f"Dimensions: {dict(dataarray.sizes)}")
print(f"Affine transform: {dataarray.rio.transform()}")
print(f"Bounding box: {dataarray.rio.bounds()}")
print(f"Coordinate reference system: {dataarray.rio.crs}")
Dimensions: {'band': 1, 'y': 2743, 'x': 3538}
Affine transform: | 80.00, 0.00, 247773.19|
| 0.00,-80.00, 314640.48|
| 0.00, 0.00, 1.00|
Bounding box: (247773.19092160053, 95200.47858355992, 530813.1909216006, 314640.4785835599)
Coordinate reference system: EPSG:32648

Rasterize vector polygons onto canvas ๐Ÿ–Œ๏ธ#

Nowโ€™s the time to paint or rasterize the vector geopandas.GeoDataFrame polygons ๐Ÿ”ท onto the blank datashader.Canvas! This would enable us to have a direct pixel-wise X -> Y mapping โ†”๏ธ between the Sentinel-1 image (X) and target flood label (Y).

The vector polygons can be rasterized or painted ๐Ÿ–Œ๏ธ onto the template canvas using zen3geo.datapipes.DatashaderRasterizer (functional name: rasterize_with_datashader).

dp_datashader = dp_canvas.rasterize_with_datashader(vector_datapipe=dp_pyogrio)
dp_datashader
DatashaderRasterizerIterDataPipe

This will turn the vector geopandas.GeoDataFrame into a raster xarray.DataArray grid, with the spatial coordinates and bounds matching exactly with the template Sentinel-1 image ๐Ÿ˜Ž.

Note

Since we have just one Sentinel-1 ๐Ÿ›ฐ๏ธ image and one raster ๐Ÿ’ง flood mask, we have an easy 1:1 mapping. There are two other scenarios supported by zen3geo.datapipes.DatashaderRasterizer:

  1. N:1 - Many datashader.Canvas objects to one vector geopandas.GeoDataFrame. The single vector geodataframe will be broadcasted to match the length of the canvas list. This is useful for situations when you have a ๐ŸŒ โ€˜globalโ€™ vector database that you want to pair with multiple ๐Ÿ›ฐ๏ธ satellite images.

  2. N:N - Many datashader.Canvas objects to many vector geopandas.GeoDataFrame objects. In this case, the list of grids must โ— have the same length as the list of vector geodataframes. E.g. if you have 5 grids, there must also be 5 vector files. This is so that a 1:1 pairing can be done, useful when each raster tile ๐Ÿ–ฝ has its own associated vector annotation.

See also

For more details on how rasterization of polygons work behind the scenes ๐ŸŽฆ, check out Datashaderโ€™s documentation on:

2๏ธโƒฃ Combine and conquer โš”๏ธ#

So far, weโ€™ve got two datapipes that should be ๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘ paired up in an X -> Y manner:

  1. The pre-processed Sentinel-1 ๐ŸŒˆ raster image in dp_decibel_image

  2. The rasterized ๐Ÿ’ง flood segmentation masks in dp_datashader

One way to get these two pieces in a Machine Learning ready chip format is via a stack, slice and split โ„ข๏ธ approach. Think of it like a sandwich ๐Ÿฅช, we first stack the bread ๐Ÿž and lettuce ๐Ÿฅฌ, and then slice the pieces ๐Ÿ• through the layers once. Ok, that was a bad analogy, letโ€™s just stick with tensors ๐Ÿคช.

Stacking the raster layers ๐Ÿฅž#

Each of our ๐ŸŒˆ raster inputs are xarray.DataArray objects with the same spatial resolution and extent ๐ŸชŸ, so these can be stacked into an xarray.Dataset with multiple data variables. First, weโ€™ll zip ๐Ÿค the two datapipes together using torchdata.datapipes.iter.Zipper (functional name: zip)

dp_zip = dp_decibel_image.zip(dp_datashader)
dp_zip
ZipperIterDataPipe

This will result in a DataPipe where each item is a tuple of (X, Y) pairs ๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘. Just to illustrate what weโ€™ve done so far, we can use torchdata.datapipes.utils.to_graph to visualize the data pipeline โ›“๏ธ.

torchdata.datapipes.utils.to_graph(dp=dp_zip)
_images/3dbca6f080f946071988edb188cd1058a1b806bae63606662b8e4cea3c70672d.svg

Next, letโ€™s combine ๐Ÿ–‡๏ธ the two (X, Y) xarray.DataArray objects in the tuple into an xarray.Dataset using torchdata.datapipes.iter.Collator (functional name: collate). Weโ€™ll also โœ‚๏ธ clip the dataset to a bounding box area where the target water mask has no 0 or NaN values.

def xr_collate_fn(image_and_mask: tuple) -> xr.Dataset:
    """
    Combine a pair of xarray.DataArray (image, mask) inputs into an
    xarray.Dataset with two data variables named 'image' and 'mask'.
    """
    # Turn 2 xr.DataArray objects into 1 xr.Dataset with multiple data vars
    image, mask = image_and_mask
    dataset: xr.Dataset = xr.merge(
        objects=[image.isel(band=0).rename("image"), mask.rename("mask")],
        join="override",
    )

    # Clip dataset to bounding box extent of where labels are
    mask_extent: tuple = mask.where(cond=mask == 1, drop=True).rio.bounds()
    clipped_dataset: xr.Dataset = dataset.rio.clip_box(*mask_extent)

    return clipped_dataset
dp_dataset = dp_zip.collate(collate_fn=xr_collate_fn)
dp_dataset
CollatorIterDataPipe

Double check to see that resulting xarray.Datasetโ€™s image and mask looks ok ๐Ÿ™†โ€โ™‚๏ธ.

it = iter(dp_dataset)
dataset = next(it)

# Create subplot with VV image on the left and Water mask on the right
fig, axs = plt.subplots(ncols=2, figsize=(11.5, 4.5), sharey=True)
dataset.image.plot.imshow(ax=axs[0], cmap="Blues_r")
axs[0].set_title("Sentinel-1 VV channel")
dataset.mask.plot.imshow(ax=axs[1], cmap="Blues")
axs[1].set_title("Water mask")
plt.show()
/home/docs/checkouts/readthedocs.org/user_builds/zen3geo/envs/latest/lib/python3.11/site-packages/torch/utils/data/datapipes/iter/combining.py:333: UserWarning: Some child DataPipes are not exhausted when __iter__ is called. We are resetting the buffer and each child DataPipe will read from the start again.
  warnings.warn("Some child DataPipes are not exhausted when __iter__ is called. We are resetting "
/home/docs/checkouts/readthedocs.org/user_builds/zen3geo/envs/latest/lib/python3.11/site-packages/pyogrio/geopandas.py:49: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_datetime without passing `errors` and catch exceptions explicitly instead
  res = pd.to_datetime(ser, **datetime_kwargs)
/home/docs/checkouts/readthedocs.org/user_builds/zen3geo/envs/latest/lib/python3.11/site-packages/pyogrio/geopandas.py:49: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_datetime without passing `errors` and catch exceptions explicitly instead
  res = pd.to_datetime(ser, **datetime_kwargs)
_images/8992c9e2269903610294c7187f427ef224fda9c26a69a527d64cbfc837574d7a.png

Slice into chips and turn into tensors ๐Ÿ—ก๏ธ#

To cut ๐Ÿ”ช the xarray.Dataset into 128x128 sized chips, weโ€™ll use zen3geo.datapipes.XbatcherSlicer (functional name: slice_with_xbatcher). Refer to Chipping and batching data if you need a ๐Ÿง‘โ€๐ŸŽ“ refresher.

dp_xbatcher = dp_dataset.slice_with_xbatcher(input_dims={"y": 128, "x": 128})
dp_xbatcher
XbatcherSlicerIterDataPipe

Next step is to convert the 128x128 chips into a torch.Tensor via torchdata.datapipes.iter.Mapper (functional name: map). The ๐Ÿ›ฐ๏ธ Sentinel-1 image and ๐Ÿ’ง water mask will be split out at this point too.

def dataset_to_tensors(chip: xr.Dataset) -> (torch.Tensor, torch.Tensor):
    """
    Converts an xarray.Dataset into to two torch.Tensor objects, the first one
    being the satellite image, and the second one being the target mask.
    """
    image: torch.Tensor = torch.as_tensor(chip.image.data)
    mask: torch.Tensor = torch.as_tensor(chip.mask.data.astype("uint8"))

    return image, mask
dp_map = dp_xbatcher.map(fn=dataset_to_tensors)
dp_map
MapperIterDataPipe

At this point, we could do some batching and collating, but weโ€™ll point you again to Chipping and batching data to figure it out ๐Ÿ˜. Letโ€™s take a look at a graph of the complete data pipeline.

torchdata.datapipes.utils.to_graph(dp=dp_map)
_images/497f2dd3cf01a62e1fb7aa648a0498c0f6ec88e4ba22b52ddc28c73ad02f1c4d.svg

Sweet, time for the final step โฉ.

Into a DataLoader ๐Ÿ‹๏ธ#

Pass the DataPipe into torch.utils.data.DataLoader ๐Ÿคพ!

dataloader = torch.utils.data.DataLoader(dataset=dp_map)
for i, batch in enumerate(dataloader):
    image, mask = batch
    print(f"Batch {i} - image: {image.shape}, mask: {mask.shape}")
/home/docs/checkouts/readthedocs.org/user_builds/zen3geo/envs/latest/lib/python3.11/site-packages/torch/utils/data/datapipes/iter/combining.py:333: UserWarning: Some child DataPipes are not exhausted when __iter__ is called. We are resetting the buffer and each child DataPipe will read from the start again.
  warnings.warn("Some child DataPipes are not exhausted when __iter__ is called. We are resetting "
/home/docs/checkouts/readthedocs.org/user_builds/zen3geo/envs/latest/lib/python3.11/site-packages/pyogrio/geopandas.py:49: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_datetime without passing `errors` and catch exceptions explicitly instead
  res = pd.to_datetime(ser, **datetime_kwargs)
/home/docs/checkouts/readthedocs.org/user_builds/zen3geo/envs/latest/lib/python3.11/site-packages/pyogrio/geopandas.py:49: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_datetime without passing `errors` and catch exceptions explicitly instead
  res = pd.to_datetime(ser, **datetime_kwargs)
Batch 0 - image: torch.Size([1, 128, 128]), mask: torch.Size([1, 128, 128])
Batch 1 - image: torch.Size([1, 128, 128]), mask: torch.Size([1, 128, 128])
Batch 2 - image: torch.Size([1, 128, 128]), mask: torch.Size([1, 128, 128])
Batch 3 - image: torch.Size([1, 128, 128]), mask: torch.Size([1, 128, 128])
Batch 4 - image: torch.Size([1, 128, 128]), mask: torch.Size([1, 128, 128])
Batch 5 - image: torch.Size([1, 128, 128]), mask: torch.Size([1, 128, 128])
Batch 6 - image: torch.Size([1, 128, 128]), mask: torch.Size([1, 128, 128])
Batch 7 - image: torch.Size([1, 128, 128]), mask: torch.Size([1, 128, 128])
Batch 8 - image: torch.Size([1, 128, 128]), mask: torch.Size([1, 128, 128])

Now go train some flood water detection models ๐ŸŒŠ๐ŸŒŠ๐ŸŒŠ