A PyTorch script for training a pool-detection model
  • Jupyter Notebook 58.1%
  • Python 40.3%
  • Nix 1%
  • Dockerfile 0.4%
  • Shell 0.2%
Find a file
2026-06-22 21:40:04 +00:00
coelho Update "working" notebook with possible inference improvements 2026-06-22 20:51:17 +00:00
configs Initial work toward separate model support 2026-06-22 19:07:00 +00:00
doc Create a plan for multi-model usage 2026-06-22 12:05:02 -05:00
notebooks Add basic framework for the project with tutorail commentary 2026-06-19 03:50:32 +00:00
requirements Add instructions in the README for creating a Python environment 2026-06-22 17:49:40 +00:00
scripts Don't crash on file page of tasks 2026-06-22 21:40:04 +00:00
src Initial work toward separate model support 2026-06-22 19:07:00 +00:00
tests Add basic framework for the project with tutorail commentary 2026-06-19 03:50:32 +00:00
.dockerignore Add initial containerfile for setting up the environment for running the tool 2026-06-20 18:12:12 -05:00
.gitignore bind-mount the trained data cache so we don't re-download 2026-06-21 10:13:22 -05:00
Containerfile Add instructions in the README for creating a Python environment 2026-06-22 17:49:40 +00:00
flake.lock Create basic flake.nix for setting up the environment. 2026-06-19 03:36:16 +00:00
flake.nix Add basic framework for the project with tutorail commentary 2026-06-19 03:50:32 +00:00
README.md Add information on task caching 2026-06-22 21:33:56 +00:00
run-container.sh bind-mount the trained data cache so we don't re-download 2026-06-21 10:13:22 -05:00
run-visualizer.sh Add support for debugging prediction masks, update LS defaults 2026-06-22 11:08:07 -05:00

Pool model trainer

Fine-tune a computer vision model to detect swimming pools in residential areas using overhead aerial or satellite photography.


Quick start

# 1. Enter the Nix dev shell (pulls PyTorch, OpenCV, rasterio, etc.)
nix develop

# 2. Install the one pip-only dependency
pip install segmentation-models-pytorch

# 3. Prepare your data  (see "Data format" below)
#    → images in  data/images/
#    → masks in    data/masks/

# 4. Train
python scripts/train.py --config configs/default.yaml

# 5. Watch progress
tensorboard --logdir runs/

# 6. Run on a new image
python -m src.inference.predict \
    --checkpoint runs/<run-name>/checkpoints/best.pth \
    --image path/to/new_tile.png

Setup without Nix (bare metal / virtual env)

If you don't use Nix — for example on a GPU server where CUDA drivers and toolkits are pre-installed — you can create a standard Python virtual environment using the requirements/ files.

1. System dependencies

Install the native libraries needed by OpenCV, PyTorch, and compiled Python packages:

# Debian / Ubuntu
sudo apt-get update && sudo apt-get install -y \
    libgl1-mesa-glx libglib2.0-0 libgomp1 gcc g++ make

# RHEL / Fedora / CentOS
sudo dnf install -y \
    mesa-libGL glib2 libgomp gcc gcc-c++ make

2. Create a virtual environment

python3.11 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

3. Install PyTorch

Pick the index URL that matches your hardware:

# CPU only
pip install -r requirements/torch.txt --index-url https://download.pytorch.org/whl/cpu

# CUDA 11.8
pip install -r requirements/torch.txt --index-url https://download.pytorch.org/whl/cu118

# CUDA 12.4
pip install -r requirements/torch.txt --index-url https://download.pytorch.org/whl/cu124

4. Install project dependencies

# Core training dependencies (image/CV, data wrangling, training utils)
pip install -r requirements/base.txt

# Optional dev tooling (Jupyter, Label Studio SDK, etc.)
pip install -r requirements/dev.txt

5. Verify

python -c "import torch; print(torch.cuda.is_available())"
python -c "import segmentation_models_pytorch as smp; print(smp.__version__)"

Then follow the same Training and Inference instructions above.


Data format

The training pipeline expects this directory layout:

data/
├── images/          # aerial / satellite tiles
│   ├── tile_001.png
│   ├── tile_002.png
│   ├── ...
│   └── tile_999.png
├── masks/           # label masks — one per image, same stem name
│   ├── tile_001.png
│   ├── tile_002.png
│   └── ...
├── train.txt        # (optional) newline-separated stem names for training
└── val.txt          # (optional) newline-separated stem names for validation

Image files

  • Format: PNG is preferred. JPEG and GeoTIFF (.tif / .tiff) also work.
  • Size: The model resizes everything to image_size × image_size (default 256×256) during loading, so source images can be any resolution. Larger tiles give the model more context at the expense of GPU memory.
  • Channels: RGB (3 channels). If your imagery has a near-infrared band, set model.in_channels: 4 in the config.

Mask files

  • Format: Single-channel (grayscale) PNG.
  • Pixel values: Integer class labels.
    Value Meaning
    0 Background (not a pool)
    1 Pool
  • Same dimensions as the image. The model resizes masks together with images so spatial alignment is preserved through nearest-neighbor interpolation on the mask.
  • Filenames must match the image they annotate. If tile_001.png is the image, the mask must be named tile_001.png as well (just in a different folder).

Split files (train.txt / val.txt)

Plain text files listing which samples go into each split, one per line, without the file extension:

# data/train.txt
tile_001
tile_003
tile_007
...

If you omit these files, the pipeline splits the data automatically using data.val_fraction (default 15%).


Exporting from Label Studio

Label Studio is the most popular open-source tool for image annotation. This section covers how to get your annotations out of Label Studio and into the format above.

There are two workflows:

  1. SDK-based (recommended) — export directly via the Label Studio API, then convert. No manual download step.
  2. Manual JSON export — download the JSON from the Label Studio UI, then convert locally.

Step 1 — Set up your Label Studio project

  1. Create a new project with Labeling SetupComputer VisionSemantic Segmentation.
  2. Under Labeling Interface, add a Brush with nested Labels tag:
<View>
  <Image name="image" value="$image"/>
  <Brush name="pool" toName="image">
    <Labels name="labels" toName="image">
      <Label value="Swimming Pool" />
    </Labels>
  </Brush>
</View>
  1. Import your aerial/satellite tiles through the Label Studio UI.
  2. Annotate pools by painting over them with the brush tool.
    • Use a brush size appropriate to your image resolution.
    • Be consistent — label pool water only (or define a convention like "pool water + visible coping" and stick to it across all annotators).

Step 2 — Export from Label Studio

Use the Label Studio SDK to export directly from the API. This avoids manual download steps and can export + convert in a single command.

# Set credentials
export LABEL_STUDIO_URL=https://labelstudio.example.com
export LABEL_STUDIO_API_KEY=your_access_token_here

# Export only (saves JSON for inspection / reuse)
python scripts/export_label_studio_sdk.py \
    --project-id 1 \
    --output project-export.json

# Export + convert to training format in one step
python scripts/export_label_studio_sdk.py \
    --project-id 1 \
    --translate \
    --image-root /path/to/original/images/ \
    --output-dir data/

Credentials can be passed via --url / --api-key CLI flags or the LABEL_STUDIO_URL / LABEL_STUDIO_API_KEY environment variables.

Install the SDK:

pip install label-studio-sdk

Option B — Manual JSON export

  1. Go to your project page and click Export.
  2. Choose JSON as the export format.
  3. Click Export to download a file like project-1-at-2026-06-19.json.

Step 2a — Inspect export statistics

Before converting, verify your export looks right:

python scripts/stats_label_studio.py project-1-at-2026-06-19.json

Outputs:

  • Total tasks in the export
  • Completed vs incomplete tasks
  • How many tasks show pool areas vs no pools

Step 3 — Convert to training format

Run the conversion script. Point it at either a local image directory or an S3-compatible bucket (AWS S3, Garage, MinIO, etc.).

Local images:

python scripts/convert_label_studio.py \
    --input project-1-at-2026-06-19.json \
    --image-root /path/to/original/images/ \
    --output-dir data/

S3 images (e.g. self-hosted Garage):

python scripts/convert_label_studio.py \
    --input project-1-at-2026-06-19.json \
    --s3-bucket pool-tiles \
    --s3-prefix tiles/ \
    --s3-endpoint https://garage.example.com \
    --s3-access-key GK... \
    --s3-secret-key ... \
    --output-dir data/

Credentials can also be set via the standard environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY instead of CLI flags.

What this does:

  • Downloads (or copies/symlinks) each annotated image into data/images/
  • Decodes the RLE masks (or rasterizes polygons) into data/masks/ — uses the annotation canvas dimensions (original_width / original_height) from the Label Studio export so masks are correctly aligned with the image
  • Images annotated as "no pools" get an all-zeros mask (treated as background)
  • Writes data/train.txt and data/val.txt with an 85/15 split
  • Skips images that have never been annotated (truly unlabeled)
  • Skips duplicate stems (same image appearing in multiple tasks)
  • Prints a summary of what was exported

Full usage:

python scripts/convert_label_studio.py \
    --input export.json \
    --image-root /original/tiles/ \          # local images, OR:
    --s3-bucket my-bucket \                  #   S3 bucket
    --s3-endpoint https://s3.example.com \   #   S3 endpoint (for non-AWS)
    --s3-prefix tiles/ \                     #   optional key prefix
    --s3-access-key ... --s3-secret-key ...  #   S3 credentials
    --output-dir data/ \
    --val-fraction 0.15 \
    --image-format png \       # or jpg, tif
    --mask-format png \
    --copy                     # copy files, don't symlink (use --symlink for the opposite)
    --seed 42

Note: S3 support requires the boto3 package. RLE decoding requires label-studio-converter. Install both with:

pip install boto3 label-studio-converter

Step 4 — Verify the conversion

Spot-check samples with the built-in web viewer:

python scripts/visualize_data.py --data-dir data/ --port 8080

This starts a local web server showing each sample's source image, mask, and a red-tinted overlay side by side. Navigate with the on-screen buttons or the left/right arrow keys.

If you're running on a remote server, create an SSH tunnel:

ssh -L 8080:localhost:8080 your-server

Then open http://localhost:8080 in your browser.


Training

Before you start

The pretrained encoder weights are downloaded from Hugging Face Hub. To avoid rate-limiting and enable faster downloads, set a Hugging Face token:

export HF_TOKEN=hf_your_token_here

You can get a free token at https://huggingface.co/settings/tokens. Without one training still works, but downloads may be slower.

First run

python scripts/train.py --config configs/default.yaml

This will:

  1. Read images + masks from data/
  2. Build a UNet with a ResNet-34 encoder (pretrained on ImageNet)
  3. Train for 50 epochs, validating after each epoch
  4. Log loss, IoU, and Dice to TensorBoard
  5. Save the best checkpoint (by validation IoU) to runs/<timestamp>/checkpoints/best.pth

Override config from the command line

python scripts/train.py --config configs/default.yaml \
    training.num_epochs=100 \
    training.batch_size=4 \
    model.encoder_name=resnet50

Resume from a checkpoint

python scripts/train.py --config configs/default.yaml \
    training.resume_from=runs/20260619_120000/checkpoints/best.pth

Monitor with TensorBoard

tensorboard --logdir runs/ --port 6006
# Open http://localhost:6006 in a browser

Key metrics to watch:

  • val/iou_mean — your primary metric. Above 0.7 is good, above 0.85 is excellent.
  • train/loss — should decrease smoothly. Spikes may mean your learning rate is too high.
  • val/dice_class_1 — Dice score for the pool class only (ignores background).

Training outputs

Each training run creates a timestamped directory under runs/. Here's what ends up on disk:

runs/
└── 20260621_143052/              # auto-generated experiment name (timestamp)
    ├── tensorboard/              # TensorBoard event files
    │   └── events.out.tfevents...
    └── checkpoints/
        ├── best.pth              # checkpoint with highest val/iou_mean
        ├── epoch_0005.pth        # periodic snapshot (every save_every epochs)
        ├── epoch_0010.pth
        ├── ...
        └── last.pth              # checkpoint from the final epoch

Checkpoint contents — each .pth file is a standard PyTorch checkpoint dictionary:

Key Contents
epoch Integer — which epoch this was saved from
model_state_dict Model weights (loadable with load_state_dict)
optimizer_state_dict Optimizer state (for resuming training)
metrics Dict with val/loss, val/iou_mean, etc.

Which checkpoint should I use?

  • best.pth — use this for inference / predictions. It's the model with the highest validation IoU across all epochs.
  • last.pth — the final model state. Useful if you want to resume training later.
  • epoch_NNNN.pth — periodic snapshots. Handy if you notice the model started overfitting and want to pick an earlier epoch.

Inference

# Single image
python -m src.inference.predict \
    --checkpoint runs/run_name/checkpoints/best.pth \
    --image data/images/tile_042.png \
    --output predictions/

# All images in a directory
python -m src.inference.predict \
    --checkpoint runs/run_name/checkpoints/best.pth \
    --dir data/images/ \
    --output predictions/

# On GPU
python -m src.inference.predict \
    --checkpoint runs/run_name/checkpoints/best.pth \
    --image tile.png --device cuda

Output masks are written as grayscale PNGs where white (255) = predicted pool.


Using the model with Label Studio (pre-annotation)

Once you have a trained model, you can use it to pre-annotate tasks in Label Studio. The model takes a first pass at detecting pools on every image; annotators then review and fix the predictions instead of drawing every pool from scratch. This can dramatically speed up labeling throughput.

How it works

The script scripts/predict_label_studio.py:

  1. Connects to your Label Studio instance via the SDK.
  2. Lists all tasks in the target project.
  3. Downloads each image, runs the model, and converts the output mask into Label Studio's RLE brush-label format.
  4. Pushes the mask as a prediction on the task. When an annotator opens the task, the prediction appears as a pre-filled brush region — they can accept it, adjust the boundaries, or erase it entirely.

Quick start

# Install the SDK (one-time)
pip install label-studio-sdk

# Set credentials
export LABEL_STUDIO_URL=https://labelstudio.example.com
export LABEL_STUDIO_API_KEY=your_access_token_here

# Push predictions for all tasks in project 1
python scripts/predict_label_studio.py \
    --checkpoint runs/20260621_143052/checkpoints/best.pth \
    --project-id 1

Common workflows

Dry-run first — simulate without uploading to confirm the model produces sane output on your imagery:

python scripts/predict_label_studio.py \
    --checkpoint runs/run_name/checkpoints/best.pth \
    --project-id 1 --dry-run

Target specific tasks — only pre-annotate tasks 42, 43, and 44:

python scripts/predict_label_studio.py \
    --checkpoint runs/run_name/checkpoints/best.pth \
    --project-id 1 --task-ids 42 43 44

Skip already-predicted tasks — safe to re-run after adding new images to the project; tasks that already have predictions are left alone:

python scripts/predict_label_studio.py \
    --checkpoint runs/run_name/checkpoints/best.pth \
    --project-id 1 --skip-existing

Use GPU for faster inference on projects with many images:

python scripts/predict_label_studio.py \
    --checkpoint runs/run_name/checkpoints/best.pth \
    --project-id 1 --device cuda

Tune batch processing for higher throughput on large projects:

# Increase GPU batch size (default 8 — try 16 or 24 on GPUs with ≥16 GB VRAM)
python scripts/predict_label_studio.py \
    --checkpoint runs/run_name/checkpoints/best.pth \
    --project-id 1 --device cuda \
    --inference-batch-size 16

# Speed up task listing with page size (default 100)
# (Not exposed as a flag — edit scripts/predict_label_studio.py if needed)

The pipeline processes images in batches for maximum GPU utilisation:

  1. Downloads a batch of images from S3 in parallel (up to 8 concurrent requests)
  2. Runs GPU inference on the entire batch at once
  3. Uploads predictions to Label Studio in parallel (up to 8 concurrent requests)

This keeps the GPU busy and minimises idle time waiting for I/O. Progress is reported every 10 seconds with tasks/second, success/error counts, and ETA.

Task list caching

The task list is cached locally to avoid re-downloading it on every run (default TTL: 60 minutes). The cache lives at .cache/tasks_<project_id>.json.

Important: If predictions are created (by any process) while a cached task list is still valid, a subsequent run will use the stale cache and may skip the anti-duplicate check — potentially creating duplicate predictions on tasks that were already annotated since the cache was written. To force a fresh task list:

# Delete the cache file before running
rm .cache/tasks_1.json

# Or disable the cache entirely (always re-fetches)
python scripts/predict_label_studio.py \
    --checkpoint runs/run_name/checkpoints/best.pth \
    --project-id 1 --device cuda \
    --cache-ttl-min 0

# Or extend TTL for long-running production runs
python scripts/predict_label_studio.py \
    --checkpoint runs/run_name/checkpoints/best.pth \
    --project-id 1 --device cuda \
    --cache-ttl-min 1440  # 24 hours

Image downloads from S3 are never cached — only the task metadata list is.

Use a different label name if your labeling config doesn't use "Swimming Pool":

python scripts/predict_label_studio.py \
    --checkpoint runs/run_name/checkpoints/best.pth \
    --project-id 1 --label-name "Pool" \
    --from-name "labels" --to-name "image"

What annotators see

When a task has a prediction, Label Studio displays the pre-filled mask alongside the image. Annotators can:

  • Press Ctrl+Enter to accept the prediction as-is (it becomes the annotation).
  • Use the brush/eraser tools to refine the mask.
  • Delete the prediction and draw from scratch if the model was wrong.

Iterative improvement (active learning)

This workflow works particularly well in a loop:

  1. Label a small initial set of images in Label Studio (50100 tiles).
  2. Export the annotations and train a first model (following the sections above).
  3. Pre-annotate the remaining unlabeled tiles with the model.
  4. Review and correct the predictions — much faster than labeling from scratch.
  5. Retrain with the expanded dataset (original labels + corrected predictions).
  6. Repeat until the model is good enough that you only need to spot-check.

Each iteration should improve the model, which in turn produces better pre-annotations, which makes each labeling pass faster.

Labeling config requirements

Your Label Studio project must use a Brush with nested Labels tag (semantic segmentation with a brush tool). The default tag names match the setup described in Exporting from Label Studio:

<View>
  <Image name="image" value="$image"/>
  <Brush name="pool" toName="image">
    <Labels name="labels" toName="image">
      <Label value="Swimming Pool" />
    </Labels>
  </Brush>
</View>

If your config uses different tag names, pass --from-name, --to-name, and --label-name to match.


Multi-model training

The project supports multiple model architectures registered through a ModelSpec abstraction. Two families are included:

Model Type Architecture Task Output
segmentation SMP UNet / DeepLabV3+ / FPN / MANet Semantic segmentation Per-pixel class mask
detection Faster R-CNN (ResNet-50 FPN) Object detection Bounding boxes + scores

Training different model types

# Train a segmentation model (default)
python scripts/train.py --config configs/default.yaml

# Train a detection model (bboxes derived from masks automatically)
python scripts/train.py --config configs/default.yaml \
    model.model_type=detection \
    training.batch_size=4 \
    training.learning_rate=0.0005

Both models use the same data/ directory layout (images + masks). The detection model derives bounding boxes from masks via connected-component labeling — no separate annotation format needed.

Analysing specific Label Studio tasks

Download images from specific LS tasks, run inference with multiple models, and save predictions for visual comparison:

# Credentials (all support environment variables)
export LABEL_STUDIO_URL=https://labelstudio.example.com
export LABEL_STUDIO_API_KEY=your_token
export S3_ENDPOINT=https://s3.example.com
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export S3_REGION=garage       # or us-east-1 for AWS

python scripts/analyze_tasks.py \
    --project-id 8 \
    --task-ids 122497 122498 122499 \
    --checkpoints runs/seg/best.pth runs/det/best.pth \
    --labels "Segmentation" "Detection" \
    --output analysis/run1 \
    --include-ground-truth \
    --device cuda

The output directory layout:

analysis/run1/
├── images/               # source images downloaded from S3
│   ├── task_122497.png
│   └── ...
├── masks/                # ground truth masks (if --include-ground-truth)
│   ├── task_122497.png
│   └── ...
├── Segmentation/         # predictions from model A
│   ├── task_122497_mask.png
│   └── ...
├── Detection/            # predictions from model B
│   ├── task_122497_mask.png
│   └── ...
└── models.json           # metadata about the models used

Interactive comparison viewer

python scripts/visualize_data.py --comparison-dir analysis/run1/
# Open http://localhost:8080 in a browser
# Remote server: ssh -L 8080:localhost:8080 your-server

The comparison viewer shows the source image, ground truth mask, and one colour-coded prediction panel per model. Use the left/right arrow keys or the on-screen buttons to navigate.

Static comparison report

For batch metrics across an entire data directory:

python scripts/compare_models.py \
    --checkpoints runs/seg/best.pth runs/det/best.pth \
    --labels "UNet" "Faster-RCNN" \
    --data-dir data/ \
    --output comparisons/run1 \
    --max-samples 50 \
    --device cuda

Generates per-image visualizations in comparisons/run1/visualizations/ and a metrics summary JSON at comparisons/run1/metrics.json.


Archiving and reusing models

Once you have a well-performing model, you can package it into a portable archive for transfer between inference systems or long-term storage.

Archive a run

# Package a training run (best + last checkpoints, tensorboard, summary)
python scripts/archive_run.py runs/stadia_seg_v1

# Custom output path
python scripts/archive_run.py runs/stadia_seg_v1 -o models/pool-detector-v2.tar.gz

The archive contains only the essential outputs — no input data or intermediate epoch snapshots:

stadia_seg_v1/
├── checkpoints/
│   ├── best.pth          # best validation checkpoint
│   └── last.pth          # final epoch (resume-capable)
├── tensorboard/          # metric history
│   └── events.out.tfevents...
└── summary.json          # metrics, model config, parameter count

Inspect an archive

python scripts/archive_run.py --list models/stadia_seg_v1.tar.gz

Prints model type, best metrics, parameter count, and file listing.

Transfer and reuse

Copy the archive to another system and extract:

scp models/stadia_seg_v1.tar.gz gpu-server:/home/user/models/
ssh gpu-server
tar xzf models/stadia_seg_v1.tar.gz

The extracted directory works directly with inference and Label Studio scripts:

# Single-image inference
python -m src.inference.predict \
    --checkpoint stadia_seg_v1/checkpoints/best.pth \
    --image tile.png --device cuda

# Label Studio pre-annotation
python scripts/predict_label_studio.py \
    --checkpoint stadia_seg_v1/checkpoints/best.pth \
    --project-id 1 --device cuda

# Resume training from the last checkpoint
python scripts/train.py --config configs/default.yaml \
    training.resume_from=stadia_seg_v1/checkpoints/last.pth

The checkpoint format is self-contained — it embeds all model type and architecture metadata, so no config file is needed to reconstruct the model for inference.


Project structure

src/
├── data/
│   ├── dataset.py            # PoolDataset + DataLoader builder + bbox utilities
│   └── augmentations.py      # Shared augmentation pipelines
├── models/
│   ├── spec.py               # ModelSpec base class / protocol
│   ├── registry.py           # Global model registry (lazy discovery)
│   ├── segmentation.py       # SegmentationSpec (SMP UNet/DeepLabV3+/FPN/MANet)
│   ├── detection.py          # DetectionSpec (Faster R-CNN)
│   └── factory.py            # Low-level create_model() + count_parameters()
├── training/
│   ├── trainer.py            # Model-type-agnostic training loop
│   ├── metrics.py            # IoU, Dice, pixel accuracy, box IoU/F1
│   └── losses.py             # DiceLoss, CombinedLoss (CE + Dice)
├── inference/
│   └── predict.py            # CLI for running a trained model on new images
└── utils/
    └── config.py             # Typed configuration dataclass

configs/
└── default.yaml              # All hyperparameters in one place

scripts/
├── train.py                  # Training entry point
├── archive_run.py            # Package training runs into portable archives
├── export_label_studio_sdk.py   # Export Label Studio project via SDK
├── fast_export_ls.py            # Fast export: fetch annotated tasks via API
├── convert_label_studio.py      # Label Studio JSON → training data
├── stats_label_studio.py        # Print high-level stats from an export
├── predict_label_studio.py      # Push model predictions onto Label Studio tasks
├── analyze_tasks.py             # Download LS tasks & run inference with multiple models
├── compare_models.py            # Multi-model comparison: metrics + visualizations
└── visualize_data.py            # Web server for visually inspecting training data & predictions

notebooks/
└── 01_explore_data.ipynb     # Interactive data exploration

tests/
└── test_metrics.py           # Unit tests for metrics and losses

Adding a GPU

When you have access to an NVIDIA GPU:

  1. In flake.nix, change the torch line:

    # From:
    torch = pythonPackages.torch;
    # To:
    pythonPackages = pkgs.python3Packages;  # or pkgs.cudaPackages.python3Packages
    torch = pythonPackages.torchWithCuda;
    

    Then run nix develop again.

  2. In configs/default.yaml, set:

    training:
      use_amp: true
    device: auto
    
  3. Bump up the batch size — the GPU can handle larger batches.


Common issues

"Floating point exception" or "Illegal instruction" when importing torchvision/rasterio → Your CPU lacks AVX2 instructions. The pre-built Nix binaries are compiled for newer CPUs. Rebuild locally with nix develop --option builders '' (slow, but works) or move to a GPU server where the binaries match the hardware.

ModuleNotFoundError: segmentation_models_pytorch → Run pip install segmentation-models-pytorch inside the nix shell. It installs to .venv/ which is auto-added to PYTHONPATH.

CUDA out of memory → Reduce training.batch_size or data.image_size. Start with batch_size: 4, image_size: 224.

"unable to allocate shared memory" when running in Docker / Podman → PyTorch DataLoader workers use /dev/shm for inter-process communication, which defaults to 64 MB inside containers. Add --shm-size=2g to your docker run / podman run command, or set training.num_workers: 0 in the config.