Walkthrough#

To get it, you first see it, and then let it go

In this tutorial 🧑‍🏫, we’ll step through an Earth Observation 🛰️ data pipeline using torchdata and by the end of this lesson, you should be able to:

  • Find Cloud-Optimized GeoTIFFs (COGs) from STAC catalogs 🥞

  • Construct a DataPipe that iteratively reads several COGs in a stream 🌊

  • Loop through batches of images in a DataPipe with a DataLoader 🏋️

🎉 Getting started#

These are the tools 🛠️ you’ll need.

# Geospatial libraries
import pystac
import planetary_computer
import rioxarray
# Deep Learning libraries
import torch
import torchdata
import zen3geo

Just to make sure we’re on the same page 📃, let’s check that we’ve got compatible versions installed.

print(f"pystac version: {pystac.__version__}")
print(f"planetary-computer version: {planetary_computer.__version__}")
print(f"torch version: {torch.__version__}")

print(f"torchdata version: {torchdata.__version__}")
print(f"zen3geo version: {zen3geo.__version__}")
rioxarray.show_versions()
pystac version: 1.6.1
planetary-computer version: 0.4.7
torch version: 1.12.1+cu102
torchdata version: 0.4.1
zen3geo version: 0.4.0
rioxarray (0.12.0) deps:
  rasterio: 1.3.2
    xarray: 2022.6.0
      GDAL: 3.5.1
      GEOS: 3.10.2
      PROJ: 9.0.1
 PROJ DATA: /home/docs/checkouts/readthedocs.org/user_builds/zen3geo/envs/v0.4.0/lib/python3.10/site-packages/rasterio/proj_data
 GDAL DATA: /home/docs/checkouts/readthedocs.org/user_builds/zen3geo/envs/v0.4.0/lib/python3.10/site-packages/rasterio/gdal_data

Other python deps:
     scipy: 1.9.1
    pyproj: 3.3.1

System:
    python: 3.10.4 (main, Jun  1 2022, 20:56:54) [GCC 11.2.0]
executable: /home/docs/checkouts/readthedocs.org/user_builds/zen3geo/envs/v0.4.0/bin/python
   machine: Linux-5.15.0-1004-aws-x86_64-with-glibc2.35

0️⃣ Find Cloud-Optimized GeoTIFFs 🗺️#

Let’s get some optical satellite data using STAC! How about Sentinel-2 L2A data over Singapore 🇸🇬?

🔗 Links:

item_url = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2A_MSIL2A_20220115T032101_R118_T48NUG_20220115T170435"

# Load the individual item metadata and sign the assets
item = pystac.Item.from_file(item_url)
signed_item = planetary_computer.sign(item)
signed_item

Inspect one of the data assets 🍱#

The Sentinel-2 STAC item contains several assets. These include different 🌈 bands (e.g. ‘B02’, ‘B03’, ‘B04’). Let’s just use the ‘visual’ product for now which includes the RGB bands.

url: str = signed_item.assets["visual"].href
da = rioxarray.open_rasterio(filename=url)
da
<xarray.DataArray (band: 3, y: 10980, x: 10980)>
[361681200 values with dtype=uint8]
Coordinates:
  * band         (band) int64 1 2 3
  * x            (x) float64 3e+05 3e+05 3e+05 ... 4.098e+05 4.098e+05 4.098e+05
  * y            (y) float64 2e+05 2e+05 2e+05 ... 9.026e+04 9.026e+04 9.024e+04
    spatial_ref  int64 0
Attributes:
    _FillValue:    0.0
    scale_factor:  1.0
    add_offset:    0.0

This is how the Sentinel-2 image looks like over Singapore on 15 Jan 2022.

Sentinel-2 image over Singapore on 20220115

1️⃣ Construct DataPipe 📡#

A torch DataPipe is a way of composing data (rather than inheriting data). Yes, I don’t know what it really means either, so here’s some extra reading.

🔖 References:

Create an Iterable 📏#

Start by wrapping a list of URLs to the Cloud-Optimized GeoTIFF files. We only have 1 item so we’ll use [url], but if you have more, you can do [url1, url2, url3], etc. Pass this iterable list into torchdata.datapipes.iter.IterableWrapper:

dp = torchdata.datapipes.iter.IterableWrapper(iterable=[url])
dp
IterableWrapperIterDataPipe

The dp variable is the DataPipe! Now to apply some more transformations/functions on it.

Read using RioXarrayReader 🌐#

This is where ☯ zen3geo comes in. We’ll be using the zen3geo.datapipes.rioxarray.RioXarrayReaderIterDataPipe class, or rather, the short alias zen3geo.datapipes.RioXarrayReader.

Confusingly, there are two ways or forms of applying RioXarrayReader, a class-based method and a functional method.

# Using class constructors
dp_rioxarray = zen3geo.datapipes.RioXarrayReader(source_datapipe=dp)
dp_rioxarray
RioXarrayReaderIterDataPipe
# Using functional form (recommended)
dp_rioxarray = dp.read_from_rioxarray()
dp_rioxarray
RioXarrayReaderIterDataPipe

Note that both ways are equivalent (they produce the same IterDataPipe output), but the latter (functional) form is preferred, see also https://pytorch.org/data/0.4/tutorial.html#registering-datapipes-with-the-functional-api

What if you don’t want the whole Sentinel-2 scene at the full 10m resolution? Since we’re using Cloud-Optimized GeoTIFFs, you could set an overview_level (following https://corteva.github.io/rioxarray/stable/examples/COG.html).

dp_rioxarray_zoom3 = dp.read_from_rioxarray(overview_level=3)
dp_rioxarray_zoom3
RioXarrayReaderIterDataPipe

Extra keyword arguments will be handled by rioxarray.open_rasterio() or rasterio.open().

Note

Other DataPipe classes/functions can be stacked or joined to this basic GeoTIFF reader. For example, clipping by bounding box or reprojecting to a certain Coordinate Reference System. If you would like to implement this, check out the Contributing Guidelines to get started!

2️⃣ Loop through DataPipe ⚙️#

A DataPipe describes a flow of information. Through a series of steps it goes, as one piece comes in, another might follow.

Basic iteration ♻️#

At the most basic level, you could iterate through the DataPipe like so:

it = iter(dp_rioxarray_zoom3)
dataarray = next(it)
dataarray
<xarray.DataArray (band: 3, y: 687, x: 687)>
[1415907 values with dtype=uint8]
Coordinates:
  * band         (band) int64 1 2 3
  * x            (x) float64 3.001e+05 3.002e+05 ... 4.096e+05 4.097e+05
  * y            (y) float64 2e+05 1.998e+05 1.996e+05 ... 9.048e+04 9.032e+04
    spatial_ref  int64 0
Attributes:
    _FillValue:    0.0
    scale_factor:  1.0
    add_offset:    0.0

Or if you’re more familiar with a for-loop, here it is:

for dataarray in dp_rioxarray_zoom3:
    print(dataarray)
    # Run model on this data batch
StreamWrapper<<xarray.DataArray (band: 3, y: 687, x: 687)>
[1415907 values with dtype=uint8]
Coordinates:
  * band         (band) int64 1 2 3
  * x            (x) float64 3.001e+05 3.002e+05 ... 4.096e+05 4.097e+05
  * y            (y) float64 2e+05 1.998e+05 1.996e+05 ... 9.048e+04 9.032e+04
    spatial_ref  int64 0
Attributes:
    _FillValue:    0.0
    scale_factor:  1.0
    add_offset:    0.0>

Into a DataLoader 🏋️#

For the deep learning folks, you might need one extra step. The xarray.DataArray needs to be converted to a tensor. In the Pytorch world, that can happen via torch.as_tensor().

def fn(da):
    return torch.as_tensor(da.data)

Using torchdata.datapipes.iter.Mapper (functional name: map), we’ll apply the tensor conversion function to each dataarray in the DataPipe.

dp_tensor = dp_rioxarray_zoom3.map(fn=fn)
dp_tensor
MapperIterDataPipe

Finally, let’s put our DataPipe into a torch.utils.data.DataLoader!

dataloader = torch.utils.data.DataLoader(dataset=dp_tensor)
for batch in dataloader:
    tensor = batch
    print(tensor)
tensor([[[[ 46,  29,  34,  ..., 241, 246, 255],
          [ 83,  73,  66,  ..., 248, 255, 251],
          [ 53,  43,  55,  ..., 246, 247, 243],
          ...,
          [101, 101, 104,  ...,  78, 179,  83],
          [ 99, 103, 105,  ...,  68,  60,  45],
          [ 95, 103, 102,  ...,  50,  34,  42]],

         [[ 58,  24,  44,  ..., 255, 255, 255],
          [ 60,  51,  57,  ..., 255, 255, 254],
          [ 47,  22,  47,  ..., 255, 255, 255],
          ...,
          [110, 111, 114,  ...,  95, 189,  87],
          [110, 113, 113,  ...,  85,  62,  48],
          [108, 112, 112,  ...,  62,  60,  62]],

         [[ 42,  22,  29,  ..., 255, 255, 255],
          [ 43,  41,  39,  ..., 255, 255, 254],
          [ 35,  30,  37,  ..., 255, 255, 255],
          ...,
          [ 82,  82,  83,  ...,  74, 174,  57],
          [ 82,  84,  84,  ...,  57,  46,  28],
          [ 80,  83,  82,  ...,  37,  31,  31]]]], dtype=torch.uint8)

And so it begins 🌄


That’s all 🎉! For more information on how to use DataPipes, check out:

If you have any questions 🙋, feel free to ask us anything at https://github.com/weiji14/zen3geo/discussions or visit the Pytorch forums at https://discuss.pytorch.org/c/data/37.

Cheers!