git.djapps.eu Git - pkg/ggml/sources/llama.cpp/commit

author	Georgi Gerganov <redacted>
	Wed, 16 Jul 2025 13:35:42 +0000 (16:35 +0300)
committer	GitHub <redacted>
	Wed, 16 Jul 2025 13:35:42 +0000 (16:35 +0300)
commit	225e7a1438f4ea85eaa7b5ef3ab3b266ee4d9c06
tree	e06dd666187699032cb1e5940a8bce0031d98b11	tree
parent	ab140198211385b85eeeb0abd549a4bbe259e10d	commit \| diff

llama : add high-throughput mode (#14363)

* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (#14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <redacted>

common/arg.cpp		diff \| blob \| history
common/common.cpp		diff \| blob \| history
common/common.h		diff \| blob \| history
examples/embedding/embedding.cpp		diff \| blob \| history
examples/parallel/parallel.cpp		diff \| blob \| history
ggml/src/ggml-cuda/fattn-common.cuh		diff \| blob \| history
ggml/src/ggml-cuda/fattn-mma-f16.cuh		diff \| blob \| history
ggml/src/ggml-cuda/fattn-tile-f16.cu		diff \| blob \| history
ggml/src/ggml-cuda/fattn-tile-f32.cu		diff \| blob \| history
ggml/src/ggml-cuda/fattn-vec-f16.cuh		diff \| blob \| history
ggml/src/ggml-cuda/fattn-vec-f32.cuh		diff \| blob \| history
ggml/src/ggml-cuda/fattn-wmma-f16.cu		diff \| blob \| history
ggml/src/ggml-cuda/ggml-cuda.cu		diff \| blob \| history
include/llama.h		diff \| blob \| history
src/llama-batch.cpp		diff \| blob \| history
src/llama-batch.h		diff \| blob \| history
src/llama-context.cpp		diff \| blob \| history
src/llama-cparams.h		diff \| blob \| history
src/llama-graph.cpp		diff \| blob \| history
src/llama-graph.h		diff \| blob \| history
src/llama-hparams.cpp		diff \| blob \| history
src/llama-hparams.h		diff \| blob \| history
src/llama-kv-cache-unified-iswa.cpp		diff \| blob \| history
src/llama-kv-cache-unified-iswa.h		diff \| blob \| history
src/llama-kv-cache-unified.cpp		diff \| blob \| history
src/llama-kv-cache-unified.h		diff \| blob \| history
src/llama-memory-hybrid.cpp		diff \| blob \| history
src/llama-model.cpp		diff \| blob \| history
tests/test-backend-ops.cpp		diff \| blob \| history
tools/batched-bench/batched-bench.cpp		diff \| blob \| history