Walkthrough
Contents
Walkthrough#
To get it, you first see it, and then let it go
In this tutorial 🧑🏫, we’ll step through an Earth Observation 🛰️ data pipeline
using torchdata
and by the end of this lesson, you should be able to:
Find Cloud-Optimized GeoTIFFs (COGs) from STAC catalogs 🥞
Construct a DataPipe that iteratively reads several COGs in a stream 🌊
Loop through batches of images in a DataPipe with a DataLoader 🏋️
🎉 Getting started#
These are the tools 🛠️ you’ll need.
# Geospatial libraries
import pystac
import planetary_computer
import rioxarray
# Deep Learning libraries
import torch
import torchdata
import zen3geo
Just to make sure we’re on the same page 📃, let’s check that we’ve got compatible versions installed.
print(f"pystac version: {pystac.__version__}")
print(f"planetary-computer version: {planetary_computer.__version__}")
print(f"torch version: {torch.__version__}")
print(f"torchdata version: {torchdata.__version__}")
print(f"zen3geo version: {zen3geo.__version__}")
rioxarray.show_versions()
pystac version: 1.4.0
planetary-computer version: 0.4.6
torch version: 1.12.0+cu102
torchdata version: 0.4.0
zen3geo version: 0.2.0+aa8b472
rioxarray (0.11.1) deps:
rasterio: 1.3.0
xarray: 2022.3.0
GDAL: 3.5.0
GEOS: 3.10.2
PROJ: 9.0.0
PROJ DATA: /home/docs/checkouts/readthedocs.org/user_builds/zen3geo/envs/v0.2.0/lib/python3.10/site-packages/rasterio/proj_data
GDAL DATA: /home/docs/checkouts/readthedocs.org/user_builds/zen3geo/envs/v0.2.0/lib/python3.10/site-packages/rasterio/gdal_data
Other python deps:
scipy: None
pyproj: 3.3.1
System:
python: 3.10.4 (main, Jun 1 2022, 20:56:54) [GCC 11.2.0]
executable: /home/docs/checkouts/readthedocs.org/user_builds/zen3geo/envs/v0.2.0/bin/python
machine: Linux-5.15.0-1004-aws-x86_64-with-glibc2.35
0️⃣ Find Cloud-Optimized GeoTIFFs 🗺️#
Let’s get some optical satellite data using STAC! How about Sentinel-2 L2A data over Singapore 🇸🇬?
🔗 Links:
item_url = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2A_MSIL2A_20220115T032101_R118_T48NUG_20220115T170435"
# Load the individual item metadata and sign the assets
item = pystac.Item.from_file(item_url)
signed_item = planetary_computer.sign(item)
signed_item
<Item id=S2A_MSIL2A_20220115T032101_R118_T48NUG_20220115T170435>
Inspect one of the data assets 🍱#
The Sentinel-2 STAC item contains several assets. These include different 🌈 bands (e.g. ‘B02’, ‘B03’, ‘B04’). Let’s just use the ‘visual’ product for now which includes the RGB bands.
url: str = signed_item.assets["visual"].href
da = rioxarray.open_rasterio(filename=url)
da
<xarray.DataArray (band: 3, y: 10980, x: 10980)> [361681200 values with dtype=uint8] Coordinates: * band (band) int64 1 2 3 * x (x) float64 3e+05 3e+05 3e+05 ... 4.098e+05 4.098e+05 4.098e+05 * y (y) float64 2e+05 2e+05 2e+05 ... 9.026e+04 9.026e+04 9.024e+04 spatial_ref int64 0 Attributes: _FillValue: 0.0 scale_factor: 1.0 add_offset: 0.0
This is how the Sentinel-2 image looks like over Singapore on 15 Jan 2022.
1️⃣ Construct DataPipe 📡#
A torch DataPipe
is a way of composing data (rather than inheriting data).
Yes, I don’t know what it really means either, so here’s some extra reading.
🔖 References:
https://pytorch.org/blog/pytorch-1.11-released/#introducing-torchdata
https://github.com/pytorch/data/tree/v0.3.0#what-are-datapipes
Create an Iterable 📏#
Start by wrapping a list of URLs to the Cloud-Optimized GeoTIFF files.
We only have 1 item so we’ll use [url]
, but if you have more, you can do
[url1, url2, url3]
, etc. Pass this iterable list into
torchdata.datapipes.iter.IterableWrapper
:
dp = torchdata.datapipes.iter.IterableWrapper(iterable=[url])
dp
IterableWrapperIterDataPipe
The dp
variable is the DataPipe!
Now to apply some more transformations/functions on it.
Read using RioXarrayReader 🌐#
This is where ☯ zen3geo
comes in. We’ll be using the
zen3geo.datapipes.rioxarray.RioXarrayReaderIterDataPipe
class, or
rather, the short alias zen3geo.datapipes.RioXarrayReader
.
Confusingly, there are two ways or forms of applying RioXarrayReader
,
a class-based method and a functional method.
# Using class constructors
dp_rioxarray = zen3geo.datapipes.RioXarrayReader(source_datapipe=dp)
dp_rioxarray
RioXarrayReaderIterDataPipe
# Using functional form (recommended)
dp_rioxarray = dp.read_from_rioxarray()
dp_rioxarray
RioXarrayReaderIterDataPipe
Note that both ways are equivalent (they produce the same IterDataPipe output), but the latter (functional) form is preferred, see also https://pytorch.org/data/0.4.0/tutorial.html#registering-datapipes-with-the-functional-api
What if you don’t want the whole Sentinel-2 scene at the full 10m resolution?
Since we’re using Cloud-Optimized GeoTIFFs, you could set an overview_level
(following https://corteva.github.io/rioxarray/stable/examples/COG.html).
dp_rioxarray_zoom3 = dp.read_from_rioxarray(overview_level=3)
dp_rioxarray_zoom3
RioXarrayReaderIterDataPipe
Extra keyword arguments will be handled by rioxarray.open_rasterio()
or rasterio.open()
.
Note
Other DataPipe classes/functions can be stacked or joined to this basic GeoTIFF reader. For example, clipping by bounding box or reprojecting to a certain Coordinate Reference System. If you would like to implement this, check out the Contributing Guidelines to get started!
2️⃣ Loop through DataPipe ⚙️#
A DataPipe describes a flow of information. Through a series of steps it goes, as one piece comes in, another might follow.
Basic iteration ♻️#
At the most basic level, you could iterate through the DataPipe like so:
it = iter(dp_rioxarray_zoom3)
dataarray = next(it)
dataarray
<xarray.DataArray (band: 3, y: 687, x: 687)> [1415907 values with dtype=uint8] Coordinates: * band (band) int64 1 2 3 * x (x) float64 3.001e+05 3.002e+05 ... 4.096e+05 4.097e+05 * y (y) float64 2e+05 1.998e+05 1.996e+05 ... 9.048e+04 9.032e+04 spatial_ref int64 0 Attributes: _FillValue: 0.0 scale_factor: 1.0 add_offset: 0.0
Or if you’re more familiar with a for-loop, here it is:
for dataarray in dp_rioxarray_zoom3:
print(dataarray)
# Run model on this data batch
StreamWrapper<<xarray.DataArray (band: 3, y: 687, x: 687)>
[1415907 values with dtype=uint8]
Coordinates:
* band (band) int64 1 2 3
* x (x) float64 3.001e+05 3.002e+05 ... 4.096e+05 4.097e+05
* y (y) float64 2e+05 1.998e+05 1.996e+05 ... 9.048e+04 9.032e+04
spatial_ref int64 0
Attributes:
_FillValue: 0.0
scale_factor: 1.0
add_offset: 0.0>
Into a DataLoader 🏋️#
For the deep learning folks, you might need one extra step.
The xarray.DataArray
needs to be converted to a tensor.
In the Pytorch world, that can happen via torch.as_tensor()
.
def fn(da):
return torch.as_tensor(da.data)
Using torchdata.datapipes.iter.Mapper
,
we’ll apply the tensor conversion function to each dataarray in the DataPipe.
dp_tensor = dp_rioxarray_zoom3.map(fn=fn)
dp_tensor
MapperIterDataPipe
Finally, let’s put our DataPipe into a torch.utils.data.DataLoader
!
dataloader = torch.utils.data.DataLoader(dataset=dp_tensor)
for batch in dataloader:
tensor = batch
print(tensor)
tensor([[[[ 46, 29, 34, ..., 241, 246, 255],
[ 83, 73, 66, ..., 248, 255, 251],
[ 53, 43, 55, ..., 246, 247, 243],
...,
[101, 101, 104, ..., 78, 179, 83],
[ 99, 103, 105, ..., 68, 60, 45],
[ 95, 103, 102, ..., 50, 34, 42]],
[[ 58, 24, 44, ..., 255, 255, 255],
[ 60, 51, 57, ..., 255, 255, 254],
[ 47, 22, 47, ..., 255, 255, 255],
...,
[110, 111, 114, ..., 95, 189, 87],
[110, 113, 113, ..., 85, 62, 48],
[108, 112, 112, ..., 62, 60, 62]],
[[ 42, 22, 29, ..., 255, 255, 255],
[ 43, 41, 39, ..., 255, 255, 254],
[ 35, 30, 37, ..., 255, 255, 255],
...,
[ 82, 82, 83, ..., 74, 174, 57],
[ 82, 84, 84, ..., 57, 46, 28],
[ 80, 83, 82, ..., 37, 31, 31]]]], dtype=torch.uint8)
And so it begins 🌄
That’s all 🎉! For more information on how to use DataPipes, check out:
Tutorial at https://pytorch.org/data/0.4.0/tutorial.html
Usage examples at https://pytorch.org/data/0.4.0/examples.html
If you have any questions 🙋, feel free to ask us anything at https://github.com/weiji14/zen3geo/discussions or visit the Pytorch forums at https://discuss.pytorch.org/c/data/37.
Cheers!