vulkan : mul_mat: fix UB with small warps (ggml/952)
When the device's warp size is less than 16,
it is possible for loadstride_a (mul_mm.comp:114)
and loadstride_b (mul_mm.comp:115) to be set to 0.
Because they are calculated as: the workgroup size,
multiplied by LOAD_VEC_* (which can be 1) and divided by 16.
And the workgroup size is set to be the same as the
warp/subgroup size.
The loadstride_* variables are used as increments in the
loops that populate the buffers used for the multiplication.
When they are 0 they cause an infinite loop.
But infinite loops without side-effects are UB and the
values of loadstride_* are known at compile time.
So, the compiler quietly optimizes all the loops away.
As a consequence, the buffers are not populated and
the multiplication result is just a matrix with all elements
set to 0.
We prevent the UB by making sure that the workgroup size
will never be less than 16, even if our device has a
smaller warp size (e.g. 8).
compilade [Tue, 1 Oct 2024 06:31:36 +0000 (02:31 -0400)]
convert : refactor rope_freqs generation (#9396)
* convert : refactor rope_freqs generation
This should also fix vocab-only conversion for Phi-3.
* convert : adapt MiniCPM3 to separate rope_freqs insertion
MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid
having to run its custom Python code which mixes tokenization
in the same file as tool calls.
gguf-py : add long and short RoPE factors to tensor mappings
Empty, but the key names are used to populate the mappings.
vulkan : argsort barriers must be under uniform control flow (ggml/951)
a return before a barrier (that happens only in some threads in
a workgroup) leads to UB.
While the old code actually works on some devices,
it fails on some others (i.e. "smaller" GPUs).
BTW, I think it would be better to set specialization constants
when the graph is built, in that way the local workgroup
could be sized appropriately.
But it would take a lot of work.
common : ensure llama_batch size does not exceed max size (#9668)
A crash was observed when the number of tokens added to a batch exceeds
llama_batch size. An assertion in llama_batch_add was added to protect
against llama_batch size overflow.
Gabe Goodhart [Wed, 25 Sep 2024 07:06:52 +0000 (01:06 -0600)]
llama : add IBM Granite MoE architecture (#9438)
* feat(gguf-py): Add granitemoe architecture
This includes the addition of new tensor names for the new moe layers.
These may not be correct at this point due to the need for the hack in
gguf_writer.py to double-check the length of the shape for these layers.
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <redacted>
* feat(convert_hf_to_gguf): Add GraniteMoeModel
GraniteMoe has the same configuration deltas as Granite
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <redacted>
* fix(granitemoe convert): Split the double-sized input layer into gate and up
After a lot of staring and squinting, it's clear that the standard mixtral
expert implementation is equivalent to the vectorized parallel experts in
granite. The difference is that in granite, the w1 and w3 are concatenated
into a single tensor "input_linear." Rather than reimplementing all of the
math on the llama.cpp side, the much simpler route is to just split this
tensor during conversion and follow the standard mixtral route.
Branch: GraniteMoE
Co-Authored-By: alex.brooks@ibm.com Signed-off-by: Gabe Goodhart <redacted>
* feat(granitemoe): Implement granitemoe
GraniteMoE follows the mixtral architecture (once the input_linear layers
are split into gate_exps/up_exps). The main delta is the addition of the
same four multipliers used in Granite.
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <redacted>
* Typo fix in docstring
Co-Authored-By: ggerganov@gmail.com Co-authored-by: Georgi Gerganov <redacted> Signed-off-by: Gabe Goodhart <redacted>
* fix(conversion): Simplify tensor name mapping in conversion
Branch: GraniteMoE
Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <redacted>
* fix(convert): Remove unused tensor name mappings
Branch: GraniteMoE
Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <redacted>
* fix(convert): Sanity check on merged FFN tensor sizes
Branch: GraniteMoE
Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <redacted>
* fix: Allow "output" layer in granite moe architecture (convert and cpp)
Branch: GraniteMoE
Co-Authored-By: git@compilade.net Signed-off-by: Gabe Goodhart <redacted>
* fix(granite): Add missing 'output' tensor for Granite
This is a fix for the previous `granite` architecture PR. Recent snapshots
have included this (`lm_head.weights`) as part of the architecture
Branch: GraniteMoE
Signed-off-by: Gabe Goodhart <redacted>
---------
Signed-off-by: Gabe Goodhart <redacted> Co-authored-by: Georgi Gerganov <redacted>
Max Krasnyansky [Mon, 23 Sep 2024 18:42:43 +0000 (11:42 -0700)]
threads: improve ggml_barrier scaling with large number of threads (#9598)
Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing.
This optimization shows performance improvements even for n_threads <= 8 cases.
Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write
in the normal case and just use thread-fence as originally intended.
---
Here is the original description and suggestions from Willy Tarreau :
There's currently some false sharing between n_barrier and
n_barrier_passed that is amplified in ggml_barrier() by the fact that
all threads need to increment n_barrier when entering, while all
previous threads continue to read n_barrier_passed, waiting for the last
one to release them all. The side effect is that all these readers are
slowing down all new threads by making the cache line bounce back and
forth between readers and writers.
Just placing them in two distinct cache lines is sufficient to boost
the performance by 21% on a 80-core ARM server compared to the
no-openmp version, and by 3% compared to the openmp version.
Note that the variables could have been spread apart in the structure
as well, but it doesn't seem that the size of this threadpool struct is
critical so here we're simply aligning them.
Finally, the same issue was present when leaving the barrier since all
threads had to update the n_barrier_passed counter, though only one
would add a non-zero value. This alone is responsible for half of the
cost due to undesired serialization.
It might be possible that using a small array of n_barrier counters
could make things even faster on many-core systems, but it would likely
complicate the logic needed to detect the last thread.
Daniel Bevenius [Wed, 18 Sep 2024 11:42:36 +0000 (13:42 +0200)]
llama : use reserve/emplace_back in sampler_sample (#9534)
This commit updates the llama_sampler_sample function to use reserve and
emplace_back for the vector of llama_token_data structs.
The motivation for this change is to avoid the creation of n_vocab
default-constructed llama_token_data structs which are then
immediately overwritten.
Max Krasnyansky [Tue, 17 Sep 2024 08:19:46 +0000 (01:19 -0700)]
threadpool : skip polling for unused threads (#9461)
* threadpool: skip polling for unused threads
Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1).
This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur).
n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written
from one thread and read from other threads (not a race conditions).
* threadpool: further simplify and improve ggml_barrier
Avoid using strict memory order while polling, yet make sure that all threads go through
full memory barrier (memory fence) on ggml_barrier entrace and exit.
* threads: add simple barrier test
This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead.
* threadpool: improve thread sync for new-graphs
Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order
to keep it efficient, once the new graph is detected we do full fence using read-modify-write
with strict memory order.
* threadpool: improve abort handling
Do not use threadpool->ec (exit code) to decide whether to exit the compute loop.
threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it.
Instead introduce atomic threadpool->abort flag used for this. This is consistent with
how we handle threadpool->stop or pause.
While at it add an explicit atomic_load for n_threads_cur for consistency.
* test-barrier: release threadpool before releasing the context
fixes use-after-free detected by gcc thread-sanitizer on x86-64
for some reason llvm sanitizer is not detecting this issue.
Gabe Goodhart [Tue, 17 Sep 2024 06:44:58 +0000 (00:44 -0600)]
llama : support IBM Granite architecture (#9412)
* feat(gguf-py): Add Granite model and params to gguf-py
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <redacted>
* feat(convert_hf_to_gguf): Add registration and param setup for Granite
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <redacted>
* feat(llama.cpp): Add config parsing for Granite multiplier params
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <redacted>
* feat(llama.cpp): First pass at full port of granite deviations from llama
Something is still not working right since the results are mostly terrible,
but on occasion it's producing relevant results at this point, so
_something_ is working.
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <redacted>
* fix(llama.cpp): Determine granite language 3b instruct by vocab size
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <redacted>
* fix(convert_hf_to_gguf): Use LlamaModel as base for GraniteModel
The defaults in LlamaModel are needed for Granite as well
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <redacted>
* fix(llama.cpp): Switch Granite param names to use _scale for consistency
Other scalar multipliers are called *_scale, so this provides a more
consistent naming convention.
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <redacted>
* fix(convert_hf_to_gguf/gguf-py): _multiplier -> _scale
The transformers names with _multiplier will now be converted to the _scale
equivalent during conversion.
Branch: GraniteLM
Signed-off-by: Gabe Goodhart <redacted>
* fix(llama.cpp): Use separate switch clause for granite in llm_load_hparams