Skip to content

Parallelize convolve_2d numpy kernel (#3615)#3616

Open
brendancol wants to merge 1 commit into
mainfrom
deep-sweep-performance-convolution-2026-07-02
Open

Parallelize convolve_2d numpy kernel (#3615)#3616
brendancol wants to merge 1 commit into
mainfrom
deep-sweep-performance-convolution-2026-07-02

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Fixes #3615.

Problem

_convolve_2d_numpy iterates its two outer loops with numba.prange but was decorated @jit(nopython=True, nogil=True) with no parallel=True. Numba only parallelizes prange when parallel=True is set, so as shipped the loop ran serial on a single core. This is the main CPU convolution path, used by the numpy backend and per chunk by the dask+numpy backend.

Fix

  • Add parallel=True to the decorator.
  • Serialize the kernel launch behind a module-level threading.Lock. Dask calls the kernel per chunk under its threaded scheduler, and numba's default workqueue threading layer is not threadsafe across host threads (SIGABRT on macOS). This is the same hazard and fix already applied to terrain.py (Streaming reproject thread pool aborts the process when numba parallel kernels run concurrently #3141). A single numpy call takes the lock uncontended and still runs across all cores; concurrent dask chunk calls run one at a time, each internally parallel.

The cupy and dask+cupy paths use a separate CUDA kernel and are untouched.

Measurement

20-core host, 2000x2000 float64 raster, 15x15 kernel, end-to-end through convolve_2d:

  • before: ~356 ms
  • after: ~53 ms

Results are identical (verified np.allclose(..., equal_nan=True)).

Verification

  • xrspatial/tests/test_convolution.py: 6 passed.
  • numpy vs dask+numpy parity across all four boundary modes (nan, nearest, reflect, wrap): match.
  • cupy vs numpy parity: match.
  • Concurrent dask threaded scheduler (8 workers): runs clean, no crash or deadlock.

Benchmark

Adds benchmarks/benchmarks/convolution.py with Convolve2d (numpy/cupy/dask across sizes and kernel sizes) and CircleKernel, so the serial regression cannot silently return.

Performance context

  • OOM verdict: SAFE. Peak dask memory scales with chunk size plus overlap depth, not the full array (~20 tasks per chunk).
  • Bottleneck: compute-bound.
  • Affected backends: numpy (direct speedup), dask+numpy (per-chunk speedup, serialized behind the lock).

_convolve_2d_numpy iterated with numba.prange but was decorated
@jit(nopython=True, nogil=True) with no parallel=True, so prange
degraded to a serial range. On a 20-core host a 2000x2000 float64
raster with a 15x15 kernel dropped from ~356 ms to ~53 ms end-to-end
once parallel=True was enabled, with identical results.

Add parallel=True and serialize the kernel launch behind a
module-level threading.Lock, since dask calls it per chunk under a
threaded scheduler and numba's workqueue layer is not threadsafe
across host threads (SIGABRT on macOS, same hazard as #3141 fixed in
terrain.py). A single numpy call takes the lock uncontended and still
runs across all cores.

Add benchmarks/benchmarks/convolution.py (Convolve2d, CircleKernel)
so the regression cannot silently return.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

convolve_2d numpy kernel uses prange without parallel=True (runs serial, ~7-10x slower)

1 participant