From: MoonShadow Date: Sun, 15 Mar 2026 16:23:58 +0000 (+0800) Subject: ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain... X-Git-Tag: upstream/1.8.4~18 X-Git-Url: https://git.djapps.eu/?a=commitdiff_plain;h=b327a321a21899e17c8923671500241ed9f93670;p=pkg%2Fggml%2Fsources%2Fwhisper.cpp ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain (llama/20536) * ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain On AMD APU/iGPU devices (unified memory architecture), hipMemAdviseSetCoarseGrain returns hipErrorInvalidValue because the hint is not applicable to UMA systems. The previous CUDA_CHECK() call treated this as a fatal error, causing crashes on APU systems such as AMD Strix Halo (gfx1151). Fix: treat hipMemAdviseSetCoarseGrain as an optional performance hint - call it without error checking and clear any resulting error with hipGetLastError(). Also add pre-allocation debug logging (GGML_LOG_DEBUG) to help diagnose memory issues on APU systems, and store totalGlobalMem in device info. Context: AMD APUs on Windows are affected by a ROCm runtime bug that limits hipMallocManaged to ~64GB regardless of available system RAM. A fix has been submitted upstream: https://github.com/ROCm/rocm-systems/pull/4077 Co-Authored-By: Claude Sonnet 4.6 * ggml/hip: remove unrelated changes, keep only hipMemAdviseSetCoarseGrain fix --------- Co-authored-by: moonshadow-25 Co-authored-by: Claude Sonnet 4.6 --- diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu index ce7a80ac..3886290c 100644 --- a/ggml/src/ggml-cuda/ggml-cuda.cu +++ b/ggml/src/ggml-cuda/ggml-cuda.cu @@ -124,7 +124,10 @@ static cudaError_t ggml_cuda_device_malloc(void ** ptr, size_t size, int device) err = cudaMallocManaged(ptr, size); #if defined(GGML_USE_HIP) if (err == hipSuccess) { - CUDA_CHECK(cudaMemAdvise(*ptr, size, hipMemAdviseSetCoarseGrain, device)); + // hipMemAdviseSetCoarseGrain is an optional performance hint; + // ignore errors (e.g. hipErrorInvalidValue on some APU/iGPU configs). + cudaMemAdvise(*ptr, size, hipMemAdviseSetCoarseGrain, device); + (void)hipGetLastError(); // clear any error } // fall back to cudaMalloc if not supported (e.g. on Windows)