vulkan: add GATED_DELTA_NET op support (llama/20334)
* vulkan: add GATED_DELTA_NET op support
Implements the fused gated delta net recurrence as a Vulkan compute
shader with full support for scalar gate, KDA vector gate, GQA
broadcast, multi-token sequences, and permuted (non-contiguous) q/k
inputs. Specialization constants select head size (32/64/128) and
KDA mode at pipeline creation time.
Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150).
Co-Authored-By: Claude Opus 4.6 <redacted>
* vulkan: optimize GATED_DELTA_NET shader (Phase 1)
- vec4 dot products on all inner loops (dp4 hardware intrinsic)
- Cache exp(g) in shared memory for KDA path, eliminating ~32K
redundant global reads and ~16K redundant exp() calls per token
- vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops)
- Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops
KDA TG: +5.4% throughput. Non-KDA: no regressions.
13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150).
Co-Authored-By: Claude Opus 4.6 <redacted>
* vulkan: address review feedback for GATED_DELTA_NET