Parallelize convolve_2d numpy kernel (#3615) by brendancol · Pull Request #3616 · xarray-contrib/xarray-spatial

brendancol · 2026-07-02T15:48:19Z

Problem

_convolve_2d_numpy iterates its two outer loops with numba.prange but was decorated @jit(nopython=True, nogil=True) with no parallel=True. Numba only parallelizes prange when parallel=True is set, so as shipped the loop ran serial on a single core. This is the main CPU convolution path, used by the numpy backend and per chunk by the dask+numpy backend.

Fix

Add parallel=True to the decorator.
Serialize the kernel launch behind a module-level threading.Lock. Dask calls the kernel per chunk under its threaded scheduler, and numba's default workqueue threading layer is not threadsafe across host threads (SIGABRT on macOS). This is the same hazard and fix already applied to terrain.py (Streaming reproject thread pool aborts the process when numba parallel kernels run concurrently #3141). A single numpy call takes the lock uncontended and still runs across all cores; concurrent dask chunk calls run one at a time, each internally parallel.

The cupy and dask+cupy paths use a separate CUDA kernel and are untouched.

Measurement

20-core host, 2000x2000 float64 raster, 15x15 kernel, end-to-end through convolve_2d:

before: ~356 ms
after: ~53 ms

Results are identical (verified np.allclose(..., equal_nan=True)).

Verification

xrspatial/tests/test_convolution.py: 6 passed.
numpy vs dask+numpy parity across all four boundary modes (nan, nearest, reflect, wrap): match.
cupy vs numpy parity: match.
Concurrent dask threaded scheduler (8 workers): runs clean, no crash or deadlock.

Benchmark

Adds benchmarks/benchmarks/convolution.py with Convolve2d (numpy/cupy/dask across sizes and kernel sizes) and CircleKernel, so the serial regression cannot silently return.

Performance context

OOM verdict: SAFE. Peak dask memory scales with chunk size plus overlap depth, not the full array (~20 tasks per chunk).
Bottleneck: compute-bound.
Affected backends: numpy (direct speedup), dask+numpy (per-chunk speedup, serialized behind the lock).

@jit

_convolve_2d_numpy iterated with numba.prange but was decorated @jit(nopython=True, nogil=True) with no parallel=True, so prange degraded to a serial range. On a 20-core host a 2000x2000 float64 raster with a 15x15 kernel dropped from ~356 ms to ~53 ms end-to-end once parallel=True was enabled, with identical results. Add parallel=True and serialize the kernel launch behind a module-level threading.Lock, since dask calls it per chunk under a threaded scheduler and numba's workqueue layer is not threadsafe across host threads (SIGABRT on macOS, same hazard as #3141 fixed in terrain.py). A single numpy call takes the lock uncontended and still runs across all cores. Add benchmarks/benchmarks/convolution.py (Convolve2d, CircleKernel) so the regression cannot silently return.

github-actions Bot added the performance PR touches performance-sensitive code label Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallelize convolve_2d numpy kernel (#3615)#3616

Parallelize convolve_2d numpy kernel (#3615)#3616
brendancol wants to merge 1 commit into
mainfrom
deep-sweep-performance-convolution-2026-07-02

brendancol commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

brendancol commented Jul 2, 2026

Problem

Fix

Measurement

Verification

Benchmark

Performance context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant