]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit
ggml-webgpu: add vectorized flash attention (#20709)
* naive vectorized version
* add vectorized flash attention
* update vec version
* remove unused path and shader
* remove unused helper functions
* add comments
* remove pad path
* ggml-webgpu: fix flash-attn vec nwg=1 path and tighten vec specialization
* change back to vec4
* enable multi split
* enable vec path when:
- Q->ne[1] < 20
- Q->ne[0] % 32 == 0
- V->ne[0] % 4 == 0
- K->type == f16
* update flast_attn_vec_split.wgsl to reduce redundant workgroup barrier usage and use select
* enable vec path for q4 and q8
* flash-attn vec nwg=1 fast path (skip tmp/reduce staging)
* use packed f16 K loads in flash-attn vec split
* use packed f16 K loads in flash-attn vec split on host side
* tune flash-attn vec f16 VEC_NE by head dim
* cleanup
* cleanup
* keep host side clean
* cleanup host side
* change back to original host wait/submit behavior
* formatting
* reverted param-buffer pool r ecfactor
* add helper functions
* ggml-webgpu: move flash-attn vec pipeline caching back into shader lib
* ggml-webgpu: remove duplicate functions
* ggml-webgpu: reserve flash-attn vec scratch in dst buffer allocation
* ggml-webgpu: revert unrelated change
* ggml-webgpu: revert deleted comment
* disable uniformity check
* remove unnecessary change
* Update ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl
* Update ggml/src/ggml-webgpu/ggml-webgpu.cpp
---------
Co-authored-by: Reese Levine <redacted>