]>
git.djapps.eu Git - pkg/ggml/sources/llama.cpp/log
Brian [Mon, 5 Aug 2024 11:15:28 +0000 (21:15 +1000)]
py: Add more authorship metadata from model card (#8810)
* py: add more authorship metadata from model card
* fixup! py: add more authorship metadata from model card
fairydreaming [Mon, 5 Aug 2024 07:38:01 +0000 (09:38 +0200)]
Stop the generation when <|eom_id|> token is encountered - needed for Llama 3.1 tool call support (#8858)
* gguf-py, llama : add constants and methods related to Llama-3.1 <|eom_id|> token
* llama : find Llama-3.1 <|eom_id|> token id during vocab loading
* llama-vocab : add Llama-3.1 <|eom_id|> token to the set of tokens stopping the generation
---------
Co-authored-by: Stanisław Szymczyk <redacted>
stduhpf [Mon, 5 Aug 2024 06:18:27 +0000 (08:18 +0200)]
cmake: fix paths for vulkan shaders compilation on Windows (#8573)
* Vulkan-shaders: attempt fix compilation on windows
* fix miss-matched parenthesis
BarfingLemurs [Mon, 5 Aug 2024 05:54:10 +0000 (01:54 -0400)]
readme : update model list (#8851)
Georgi Gerganov [Mon, 5 Aug 2024 05:53:39 +0000 (08:53 +0300)]
llama : better replace_all (#8852)
0cc4m [Mon, 5 Aug 2024 05:52:55 +0000 (07:52 +0200)]
vulkan : fix Qantized Mat-Vec Mul on AMD GPUs for ncols < 64 (#8855)
* Fix Vulkan mul mat vec invalid results when ncols < warp size
* Only run backend ops mul mat vec block size test if block size not already covered
Georgi Gerganov [Sun, 4 Aug 2024 16:13:25 +0000 (19:13 +0300)]
sync : ggml
ggml-ci
0cc4m [Sun, 4 Aug 2024 15:28:08 +0000 (17:28 +0200)]
vulkan : implement Stable Diffusion operators (ggml/904)
* Fix Vulkan repeat op
* Implement Vulkan concat op
* Delete old Vulkan shader generator
* Implement Vulkan im2col op
* Implement Vulkan unary gelu_quick op
* Implement Vulkan group_norm op
* Implement Vulkan timestep_embedding op
* Implement Vulkan upscale op
* Fix Vulkan vk_context tensor extra index issue
* Fix Vulkan matmul shader parameter bug
* Properly fix Vulkan matmul shader parameter bug
* Add Vulkan ADD f16 + f32 -> f16 operator support
* Implement Vulkan tanh op
* Fix Vulkan group count too large Validation error on non-Nvidia GPUs
* Throw error when too much memory is requested
* Fix another Vulkan group count too large Validation error on non-Nvidia GPUs
* Fix matmul MMQ condition
* Implement Vulkan pad op
* Fix Vulkan crash when tensor is used multiple times in a compute graph
* Add Vulkan CONCAT f16 + f16 -> f16 op
* Add Vulkan LEAKY_RELU op
Daniel Bevenius [Mon, 29 Jul 2024 13:06:06 +0000 (15:06 +0200)]
ggml : move c parameter comment to ggml_rope_ext (ggml/901)
This commit moves the comment for the c parameter from ggml_rope to
ggml_rope_ext. The comment is currently incorrect as ggml_rope does not
have a c parameter (freq_factors tensor).
Signed-off-by: Daniel Bevenius <redacted>
wangshuai09 [Mon, 5 Aug 2024 04:22:30 +0000 (12:22 +0800)]
cann: support q4_0 model (#8822)
Brandon Squizzato [Sun, 4 Aug 2024 18:17:16 +0000 (14:17 -0400)]
Install curl in runtime layer (#8693)
ardfork [Sun, 4 Aug 2024 18:16:23 +0000 (18:16 +0000)]
Server: Don't ignore llama.cpp params (#8754)
* Don't ignore llama.cpp params
* Add fallback for max_tokens
Brian Cunnie [Sun, 4 Aug 2024 10:55:03 +0000 (03:55 -0700)]
batched-bench : handle empty `-npl` (#8839)
* [example] batched-bench "segmentation fault"
When `llama-batched-bench` is invoked _without_ setting `-npl`, "number
of parallel prompts", it segfaults.
The segfault is caused by invoking `max_element()` on a zero-length
vector, `n_pl`
This commit addresses that by first checking to see if the number of
parallel prompts is zero, and if so sets the maximum sequence size to 1;
otherwise, sets it to the original, the result of `max_element()`.
Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf`
```
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28
69 llama_context_params ctx_params = llama_context_params_from_gpt_params(params);
70
71 // ensure enough sequences are available
-> 72 ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end());
```
* Update examples/batched-bench/batched-bench.cpp
Co-authored-by: compilade <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: compilade <redacted>
Daniel Bevenius [Sat, 3 Aug 2024 13:07:47 +0000 (15:07 +0200)]
baby-llama : remove duplicate vector include
Georgi Gerganov [Sun, 4 Aug 2024 02:53:20 +0000 (05:53 +0300)]
flake.lock: Update (#8847)
jdomke [Sat, 3 Aug 2024 16:34:41 +0000 (01:34 +0900)]
ggml : reading the runtime sve config of the cpu (#8709)
* ggml : reading the runtime sve config of the cpu
* change to one time init to prevent performance drop
* prefix variable to avoid possible conflicts
* revert xxhash fix and add brackets
---------
Co-authored-by: domke <redacted>
Sigbjørn Skjæret [Fri, 2 Aug 2024 19:11:39 +0000 (21:11 +0200)]
Fix conversion of unnormalized BF16->BF16 weights (#7843)
* add truncate_bf16
* truncate intermediate fp32 if converting bf16 to bf16
* fix masking in __compute_fp32_to_bf16
* np.int16 no longer used
* missing cast and additional numpy 2.x fix
* ggml-impl : do not flush bf16 subnormals to zero
* ggml : add reference fp32 to bf16 conversion
The fast version is no longer equivalent for all platforms
because of the handling of subnormal values.
* gguf-py : remove flush to zero for bf16 subnormals
* gguf-py : remove float32 truncation to bf16
Rounding achieves the same thing in the cases where this was used.
* missed prototype update in merge
* merge cleanup
---------
Co-authored-by: Francis Couture-Harpin <redacted>
Mengqing Cao [Fri, 2 Aug 2024 08:50:53 +0000 (16:50 +0800)]
cann: Fix ggml_cann_im2col for 1D im2col (#8819)
* fix ggml_cann_im2col for 1D im2col
* fix build warning
Ouadie EL FAROUKI [Fri, 2 Aug 2024 00:55:17 +0000 (01:55 +0100)]
[SYCL] Fixing wrong VDR iq4nl value (#8812)
matteo [Thu, 1 Aug 2024 21:28:28 +0000 (23:28 +0200)]
ggml-cuda: Adding support for unified memory (#8035)
* Adding support for unified memory
* adding again the documentation about unified memory
* refactoring: Moved the unified memory code in the correct location.
* Fixed compilation error when using hipblas
* cleaning up the documentation
* Updating the documentation
Co-authored-by: Johannes Gäßler <redacted>
* adding one more case where the PR should not be enabled
---------
Co-authored-by: matteo serva <redacted>
Co-authored-by: Johannes Gäßler <redacted>
Alex O'Connell [Thu, 1 Aug 2024 16:53:46 +0000 (12:53 -0400)]
Build: Only include execinfo.h on linux systems that support it (#8783)
* Only enable backtrace on GLIBC linux systems
* fix missing file from copy
* use glibc macro instead of defining a custom one
slaren [Thu, 1 Aug 2024 13:26:22 +0000 (15:26 +0200)]
cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X (#8800)
* cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X
* update asserts
* only use dmmv for supported types
* add test
wangshuai09 [Thu, 1 Aug 2024 02:39:05 +0000 (10:39 +0800)]
cann: support q8_0 for Ascend\b backend (#8805)
Igor Okulist [Wed, 31 Jul 2024 23:59:09 +0000 (18:59 -0500)]
server : update llama-server embedding flag documentation (#8779)
Fixes #8763
Clint Herron [Wed, 31 Jul 2024 19:51:06 +0000 (15:51 -0400)]
Build: Fix potential race condition (#8781)
* Fix potential race condition as pointed out by @fairydreaming in #8776
* Reference the .o rather than rebuilding every time.
* Adding in CXXFLAGS and LDFLAGS
* Removing unnecessary linker flags.
pculliton [Wed, 31 Jul 2024 15:12:10 +0000 (11:12 -0400)]
Adding Gemma 2 2B configs (#8784)
* Adding Gemma 2 2B configs
Updates to Q scaling and Gemma 2 model sizes to match v2 2B model.
* Update src/llama.cpp
Co-authored-by: slaren <redacted>
---------
Co-authored-by: slaren <redacted>
Borislav Stanimirov [Wed, 31 Jul 2024 13:40:08 +0000 (16:40 +0300)]
cmake : fix use of external ggml (#8787)
Someone [Tue, 30 Jul 2024 20:35:30 +0000 (23:35 +0300)]
nix: cuda: rely on propagatedBuildInputs (#8772)
Listing individual outputs no longer necessary to reduce the runtime closure size after https://github.com/NixOS/nixpkgs/pull/323056.
Brian [Tue, 30 Jul 2024 14:57:03 +0000 (00:57 +1000)]
py: add_array() will not add to kv store if value is an empty array (#8774)
* gguf_writer.py: add_array() should not add to kv store if empty
* Apply suggestions from code review
I was wondering if there was a specific reason for `if val` but good to hear we can safely use `len(val == 0`
Co-authored-by: compilade <redacted>
---------
Co-authored-by: compilade <redacted>
l3utterfly [Tue, 30 Jul 2024 14:40:18 +0000 (23:40 +0900)]
added android implementation of ggml_print_backtrace_symbols (#8751)
* added android implementation of ggml_print_backtrace_symbols
* Update ggml/src/ggml.c
Co-authored-by: slaren <redacted>
* Update ggml/src/ggml.c
Co-authored-by: slaren <redacted>
* Update ggml/src/ggml.c
Co-authored-by: slaren <redacted>
* Update ggml/src/ggml.c
Co-authored-by: slaren <redacted>
* Update ggml/src/ggml.c
Co-authored-by: slaren <redacted>
---------
Co-authored-by: slaren <redacted>
Georgi Gerganov [Tue, 30 Jul 2024 12:58:57 +0000 (15:58 +0300)]
flake.lock: Update (#8729)
wangshuai09 [Tue, 30 Jul 2024 10:37:35 +0000 (18:37 +0800)]
cann: update cmake (#8765)
zhentaoyu [Tue, 30 Jul 2024 06:56:51 +0000 (14:56 +0800)]
[SYCL] Add `TIMESTEP_EMBEDDING` OP (#8707)
Signed-off-by: zhentaoyu <redacted>
CarterLi999 [Mon, 29 Jul 2024 16:38:34 +0000 (00:38 +0800)]
ggml: bugfix: fix the inactive elements is agnostic for risc-v vector (#8748)
In these codes, we want to retain the value that they previously held
when mask[i] is false. So we should use undisturbed. With the default
agnostic policy of rvv intrinsic, these values can be held or be
written with 1s.
Co-authored-by: carter.li <redacted>
R0CKSTAR [Mon, 29 Jul 2024 12:56:12 +0000 (20:56 +0800)]
cuda : organize vendor-specific headers into vendors directory (#8746)
Signed-off-by: Xiaodong Ye <redacted>
Meng, Hengyu [Mon, 29 Jul 2024 02:50:27 +0000 (10:50 +0800)]
[SYCL] add conv support (#8688)
Johannes Gäßler [Sun, 28 Jul 2024 20:32:44 +0000 (22:32 +0200)]
cmake: use 1 more thread for non-ggml in CI (#8740)
Austin [Sun, 28 Jul 2024 07:52:42 +0000 (03:52 -0400)]
chore : Fix vulkan related compiler warnings, add help text, improve CLI options (#8477)
* chore: Fix compiler warnings, add help text, improve CLI options
* Add prototypes for function definitions
* Invert logic of --no-clean option to be more intuitive
* Provide a new help prompt with clear instructions
* chore : Add ignore rule for vulkan shader generator
Signed-off-by: teleprint-me <redacted>
* Update ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp
Co-authored-by: 0cc4m <redacted>
* chore : Remove void and apply C++ style empty parameters
* chore : Remove void and apply C++ style empty parameters
---------
Signed-off-by: teleprint-me <redacted>
Co-authored-by: 0cc4m <redacted>
compilade [Sun, 28 Jul 2024 04:42:05 +0000 (00:42 -0400)]
llama : refactor session file management (#8699)
* llama : refactor session file management
* llama : saving and restoring state checks for overflow
The size of the buffers should now be given to the functions working
with them, otherwise a truncated file could cause out of bound reads.
* llama : stream from session file instead of copying into a big buffer
Loading session files should no longer cause a memory usage spike.
* llama : llama_state_get_size returns the actual size instead of max
This is a breaking change, but makes that function *much* easier
to keep up to date, and it also makes it reflect the behavior
of llama_state_seq_get_size.
* llama : share code between whole and seq_id-specific state saving
Both session file types now use a more similar format.
* llama : no longer store all hparams in session files
Instead, the model arch name is stored.
The layer count and the embedding dimensions of the KV cache
are still verified when loading.
Storing all the hparams is not necessary.
* llama : fix uint64_t format type
* llama : various integer type cast and format string fixes
Some platforms use "%lu" and others "%llu" for uint64_t.
Not sure how to handle that, so casting to size_t when displaying errors.
* llama : remove _context suffix for llama_data_context
* llama : fix session file loading
llama_state_get_size cannot be used to get the max size anymore.
* llama : more graceful error handling of invalid session files
* llama : remove LLAMA_MAX_RNG_STATE
It's no longer necessary to limit the size of the RNG state,
because the max size of session files is not estimated anymore.
* llama : cast seq_id in comparison with unsigned n_seq_max
R0CKSTAR [Sat, 27 Jul 2024 23:41:25 +0000 (07:41 +0800)]
feat: Support Moore Threads GPU (#8383)
* Update doc for MUSA
Signed-off-by: Xiaodong Ye <redacted>
* Add GGML_MUSA in Makefile
Signed-off-by: Xiaodong Ye <redacted>
* Add GGML_MUSA in CMake
Signed-off-by: Xiaodong Ye <redacted>
* CUDA => MUSA
Signed-off-by: Xiaodong Ye <redacted>
* MUSA adds support for __vsubss4
Signed-off-by: Xiaodong Ye <redacted>
* Fix CI build failure
Signed-off-by: Xiaodong Ye <redacted>
---------
Signed-off-by: Xiaodong Ye <redacted>
Georgi Gerganov [Sat, 27 Jul 2024 15:08:31 +0000 (18:08 +0300)]
scripts : sync vulkan-shaders (#0)
Georgi Gerganov [Sat, 27 Jul 2024 14:19:35 +0000 (17:19 +0300)]
scripts : sync ggml-aarch64 sources
Georgi Gerganov [Sat, 27 Jul 2024 12:57:09 +0000 (15:57 +0300)]
ggml : add missing semicolon (#0)
ggml-ci
Georgi Gerganov [Sat, 27 Jul 2024 12:53:48 +0000 (15:53 +0300)]
sync : ggml
ggml-ci
Mahesh Madhav [Thu, 25 Jul 2024 07:54:08 +0000 (00:54 -0700)]
ggml : loop tiling optimizations for scalar path (ggml/898)
Apply a loop tiling technique to the generic path, which provides
performance upside for ISAs with enough registers to take advantage
of it. Also helps the compiler optimize this path.
Ivan Filipov [Mon, 22 Jul 2024 11:32:02 +0000 (14:32 +0300)]
ggml: add support for float16 input tensors in pooling operations (ggml/895)
* Add support for float16 tensors in 1d pooling operations
* Add support for float16 input tensors in 2d pooling operations
* code cleanup
remove unnecessary casting during srow ptr initialization
---------
Co-authored-by: vanaka11 <redacted>
Tony Wasserka [Sat, 20 Jul 2024 18:49:44 +0000 (20:49 +0200)]
vulkan : initialize vk_buffer_struct members to VK_NULL_HANDLE (ggml/893)
This prevents invalid frees when destroying a partially initialized
vk_buffer_struct. For example, this could happen in ggml_vk_create_buffer
when running out of device memory.
Co-authored-by: Tony Wasserka <redacted>
Borislav Stanimirov [Fri, 12 Jul 2024 14:24:20 +0000 (17:24 +0300)]
cmake : only enable GGML_NATIVE and x86 flags if not crosscompiling (ggml/885)
Daniel Bevenius [Mon, 8 Jul 2024 10:03:42 +0000 (12:03 +0200)]
ggml : remove unnecessary UNUSED macro call (ggml/880)
This commit removes an UNUSED macro call that is not needed as the
variable n0 is used in the code and will not produce a warning.
Signed-off-by: Daniel Bevenius <redacted>
Jeffrey Morgan [Sat, 27 Jul 2024 12:03:45 +0000 (05:03 -0700)]
llama : add support for llama 3.1 rope scaling factors (#8676)
* Add llama 3.1 rope scaling factors to llama conversion and inference
This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <redacted>
* address comments
* address comments
* Update src/llama.cpp
Co-authored-by: compilade <redacted>
* Update convert_hf_to_gguf.py
Co-authored-by: compilade <redacted>
---------
Co-authored-by: compilade <redacted>
Georgi Gerganov [Sat, 27 Jul 2024 11:59:29 +0000 (14:59 +0300)]
llama : add function for model-based max number of graph nodes (#8622)
* llama : model-based max number of graph nodes
ggml-ci
* llama : disable 405B max_nodes path due to lack of complaints
ggml-ci
Daniel Bevenius [Sat, 27 Jul 2024 10:45:02 +0000 (12:45 +0200)]
common : add --no-warmup option for main/llama-cli (#8712)
This commit adds a --no-warmup option for llama-cli.
The motivation for this is that it can be convenient to skip the
warmup llama_decode call when debugging.
Signed-off-by: Daniel Bevenius <redacted>
wangshuai09 [Sat, 27 Jul 2024 08:36:44 +0000 (16:36 +0800)]
cann: Fix Multi-NPU execution error (#8710)
* cann: fix multi-npu exec error
* cann: update comment for ggml_backend_cann_supports_buft
slaren [Sat, 27 Jul 2024 02:41:55 +0000 (04:41 +0200)]
ggml : reduce hash table reset cost (#8698)
* ggml : reduce hash table reset cost
* fix unreachable code warnings after GGML_ASSERT(false)
* GGML_ASSERT(false) -> GGML_ABORT("fatal error")
* GGML_ABORT use format string
Judd [Fri, 26 Jul 2024 08:38:12 +0000 (16:38 +0800)]
llama : fix order of parameters (#8706)
usage of `aclrtGetMemInfo` is correct:
https://www.hiascend.com/doc_center/source/zh/canncommercial/63RC2/inferapplicationdev/aclcppdevg/aclcppdevg_03_0103.html
Co-authored-by: Judd <redacted>
Yaiko [Thu, 25 Jul 2024 22:10:16 +0000 (18:10 -0400)]
server : add Speech Recognition & Synthesis to UI (#8679)
* server : add Speech Recognition & Synthesis to UI
* server : add Speech Recognition & Synthesis to UI (fixes)
Xuan Son Nguyen [Thu, 25 Jul 2024 21:49:39 +0000 (23:49 +0200)]
examples : export-lora : fix issue with quantized base models (#8687)
DavidKorczynski [Thu, 25 Jul 2024 21:23:05 +0000 (22:23 +0100)]
ggml: handle ggml_init failure to fix NULL pointer deref (#8692)
`ggml_init` can fail if no unused context is found. In that case, a NULL-pointer deref will happen later in the code during a call to `ggml_set_on_alloc`.
This fixes it by bailing out if no context is found.
Georgi Gerganov [Thu, 25 Jul 2024 16:57:31 +0000 (19:57 +0300)]
llama : fix build + fix fabs compile warnings (#8683)
ggml-ci
Andreas (Andi) Kunar [Thu, 25 Jul 2024 16:01:00 +0000 (18:01 +0200)]
ggml : fix build on Windows with Snapdragon X (#8531)
* Improvements for Windows with Snapdragon X
* Revert "Improvements for Windows with Snapdragon X"
This reverts commit
bf21397ae5ea7c73d3494db3b91505599909227d .
* Improvements for Windows with Snapdragon X
* WOA build clarifications
* WIndows on ARM build clarifications
* cmake build for Windows clarifications
* Update docs/build.md
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: AndreasKunar <andreaskmsn.com>
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Thu, 25 Jul 2024 15:57:44 +0000 (18:57 +0300)]
tests : fix printfs (#8068)
Chen Xi [Thu, 25 Jul 2024 11:45:18 +0000 (11:45 +0000)]
[SYCL] fix multi-gpu issue on sycl (#8554)
---------
Signed-off-by: Chen Xi <redacted>
Co-authored-by: Meng, Hengyu <redacted>
Georgi Gerganov [Thu, 25 Jul 2024 09:37:42 +0000 (12:37 +0300)]
ggml : add and use ggml_cpu_has_llamafile() (#8664)
Xuan Son Nguyen [Thu, 25 Jul 2024 08:39:04 +0000 (10:39 +0200)]
examples : remove `finetune` and `train-text-from-scratch` (#8669)
* examples : remove finetune and train-text-from-scratch
* fix build
* update help message
* fix small typo for export-lora
Ujjawal Panchal [Thu, 25 Jul 2024 08:13:27 +0000 (13:43 +0530)]
docs : Quantum -> Quantized (#8666)
* docfix: imatrix readme, quantum models -> quantized models.
* docfix: server readme: quantum models -> quantized models.
Fan Shupei [Thu, 25 Jul 2024 07:21:09 +0000 (15:21 +0800)]
llama: use sliding window for phi3 (#8627)
* use sliding window for phi3
* fix typo, "data_swa" -> "data"
* [conver_hf_to_gguf.py] add phi3 sliding window
MorganRO8 [Wed, 24 Jul 2024 16:48:00 +0000 (12:48 -0400)]
readme : update games list (#8673)
Added link to game I made that depends on llama
Joe Todd [Wed, 24 Jul 2024 13:36:00 +0000 (14:36 +0100)]
Build Llama SYCL Intel with static libs (#8668)
Ensure SYCL CI builds both static & dynamic libs for testing purposes
Signed-off-by: Joe Todd <redacted>
Thorsten Sommer [Wed, 24 Jul 2024 12:52:30 +0000 (14:52 +0200)]
readme : update UI list [no ci] (#8505)
Xuan Son Nguyen [Wed, 24 Jul 2024 11:48:46 +0000 (13:48 +0200)]
llama : fix `llama_chat_format_single` for mistral (#8657)
* fix `llama_chat_format_single` for mistral
* fix typo
* use printf
Joe Todd [Wed, 24 Jul 2024 10:55:26 +0000 (11:55 +0100)]
Re-add erroneously removed -fsycl from GGML_EXTRA_LIBS (#8667)
Xuan Son Nguyen [Wed, 24 Jul 2024 09:25:19 +0000 (11:25 +0200)]
add llama_lora_adapter_clear (#8653)
Xuan Son Nguyen [Tue, 23 Jul 2024 21:48:37 +0000 (23:48 +0200)]
examples : Fix `llama-export-lora` example (#8607)
* fix export-lora example
* add more logging
* reject merging subset
* better check
* typo
Vali Malinoiu [Tue, 23 Jul 2024 14:37:42 +0000 (17:37 +0300)]
server : fix URL.parse in the UI (#8646)
Joe Todd [Tue, 23 Jul 2024 13:58:37 +0000 (14:58 +0100)]
sycl : Add support for non-release DPC++ & oneMKL (#8644)
* Update cmake to support nvidia hardware & open-source compiler
---------
Signed-off-by: Joe Todd <redacted>
Georgi Gerganov [Tue, 23 Jul 2024 10:10:17 +0000 (13:10 +0300)]
llama : move vocab, grammar and sampling into separate files (#8508)
* llama : move sampling code into llama-sampling
ggml-ci
* llama : move grammar code into llama-grammar
ggml-ci
* cont
ggml-ci
* cont : pre-fetch rules
* cont
ggml-ci
* llama : deprecate llama_sample_grammar
* llama : move tokenizers into llama-vocab
ggml-ci
* make : update llama.cpp deps [no ci]
* llama : redirect external API to internal APIs
ggml-ci
* llama : suffix the internal APIs with "_impl"
ggml-ci
* llama : clean-up
0cc4m [Tue, 23 Jul 2024 08:56:49 +0000 (10:56 +0200)]
Vulkan IQ4_NL Support (#8613)
* Fix Vulkan matmul tests compile errors
* Add Vulkan IQ4_NL support
* Fix Vulkan DeepSeek-Coder-V2-Lite MoE support
Jeroen Mostert [Tue, 23 Jul 2024 08:50:40 +0000 (10:50 +0200)]
Allow all RDNA2 archs to use sdot4 intrinsic (#8629)
The check gating the use of `__builtin_amdgc_sdot4` specifically checks for gfx1030. This causes a severe perf regression for anything gfx103? that's not gfx1030 and not using `HSA_OVERRIDE_GFX_VERSION` (if you've built ROCm to support it). We already have a generic RDNA2 define, let's use it.
Georgi Gerganov [Tue, 23 Jul 2024 08:28:38 +0000 (11:28 +0300)]
contrib : clarify PR squashing + module names (#8630)
* contrib : clarify PR squashing
* contrib : fix typo + add list of modules
luoyu-intel [Tue, 23 Jul 2024 07:43:28 +0000 (07:43 +0000)]
[SYCL] fix scratch size of softmax (#8642)
Keke Han [Mon, 22 Jul 2024 16:43:43 +0000 (00:43 +0800)]
llama : fix codeshell support (#8599)
* llama : fix codeshell support
* llama : move codeshell after smollm below to respect the enum order
Jason Stillerman [Mon, 22 Jul 2024 14:43:01 +0000 (10:43 -0400)]
llama : add support for SmolLm pre-tokenizer (#8609)
* Adding SmolLM Pre Tokenizer
* Update convert_hf_to_gguf_update.py
Co-authored-by: compilade <redacted>
* Update src/llama.cpp
Co-authored-by: compilade <redacted>
* handle regex
* removed .inp and out .out ggufs
---------
Co-authored-by: compilade <redacted>
Jiří Podivín [Mon, 22 Jul 2024 13:44:53 +0000 (15:44 +0200)]
*.py: Stylistic adjustments for python (#8233)
* Superflous parens in conditionals were removed.
* Unused args in function were removed.
* Replaced unused `idx` var with `_`
* Initializing file_format and format_version attributes
* Renaming constant to capitals
* Preventing redefinition of the `f` var
Signed-off-by: Jiri Podivin <redacted>
Georgi Gerganov [Mon, 22 Jul 2024 10:33:22 +0000 (13:33 +0300)]
llama : allow overrides for tokenizer flags (#8614)
ggml-ci
Georgi Gerganov [Mon, 22 Jul 2024 10:32:49 +0000 (13:32 +0300)]
tests : re-enable tokenizer tests (#8611)
* models : remove duplicated gpt-2 vocab
* models : remove old stablelm vocab
* tests : re-enable MPT tokenizer tests
* tests : re-enable DeepSeek tokenizer tests
* cmake : sort
ggml-ci
Douglas Hanley [Mon, 22 Jul 2024 08:06:17 +0000 (03:06 -0500)]
llama : add Mistral Nemo inference support (#8604)
Jan Boon [Mon, 22 Jul 2024 08:02:09 +0000 (16:02 +0800)]
server : update doc to clarify n_keep when there is bos token (#8619)
Mark Zhuang [Mon, 22 Jul 2024 07:56:45 +0000 (15:56 +0800)]
ggml: fix compile error for RISC-V (#8623)
devojony [Mon, 22 Jul 2024 06:54:42 +0000 (14:54 +0800)]
examples: fix android example cannot be generated continuously (#8621)
When generation ends `completion_loop()` should return a NULL, not the empty string
Georgi Gerganov [Sun, 21 Jul 2024 13:45:10 +0000 (16:45 +0300)]
flake.lock: Update (#8610)
M-A [Sun, 21 Jul 2024 02:09:17 +0000 (22:09 -0400)]
examples : Rewrite pydantic_models_to_grammar_examples.py (#8493)
Changes:
- Move each example into its own function. This makes the code much
easier to read and understand.
- Make the program easy to only run one test by commenting out function
calls in main().
- Make the output easy to parse by indenting the output for each example.
- Add shebang and +x bit to make it clear it's an executable.
- Make the host configurable via --host with a default 127.0.0.1:8080.
- Make the code look in the tools list to call the registered tool,
instead of hardcoding the returned values. This makes the code more
copy-pastable.
- Add error checking, so that the program exits 1 if the LLM didn't
returned expected values. It's super useful to check for correctness.
Testing:
- Tested with Mistral-7B-Instruct-v0.3 in F16 and Q5_K_M and
Meta-Llama-3-8B-Instruct in F16 and Q5_K_M.
- I did not observe a failure even once in Mistral-7B-Instruct-v0.3.
- Llama-3 failed about a third of the time in example_concurrent: it
only returned one call instead of 3. Even for F16.
Potential follow ups:
- Do not fix the prompt encoding yet. Surprisingly it mostly works even
if the prompt encoding is not model optimized.
- Add chained answer and response.
Test only change.
compilade [Sun, 21 Jul 2024 01:58:49 +0000 (21:58 -0400)]
gguf-py : fix some metadata name extraction edge cases (#8591)
* gguf-py : fix some metadata name extraction edge cases
* convert_lora : use the lora dir for the model card path
* gguf-py : more metadata edge cases fixes
Multiple finetune versions are now joined together,
and the removal of the basename annotation on trailing versions
is more robust.
* gguf-py : add more name metadata extraction tests
* convert_lora : fix default filename
The default filename was previously hardcoded.
* convert_hf : Model.fname_out can no longer be None
* gguf-py : do not use title case for naming convention
Some models use acronyms in lowercase,
which can't be title-cased like other words,
so it's best to simply use the same case
as in the original model name.
Note that the size label still has an uppercased suffix
to make it distinguishable from the context size of a finetune.
compilade [Sun, 21 Jul 2024 01:53:01 +0000 (21:53 -0400)]
convert_hf : fix Gemma v1 conversion (#8597)
* convert_hf : fix Gemma v1 conversion
* convert_hf : allow renaming tokens, but with a warning
* convert_hf : fix Gemma v1 not setting BOS and EOS tokens
Johannes Gäßler [Sat, 20 Jul 2024 20:25:26 +0000 (22:25 +0200)]
CUDA: MMQ code deduplication + iquant support (#8495)
* CUDA: MMQ code deduplication + iquant support
* 1 less parallel job for CI build
Georgi Gerganov [Sat, 20 Jul 2024 14:15:42 +0000 (17:15 +0300)]
gguf : handle null name during init (#8587)
Michael Coppola [Sat, 20 Jul 2024 13:43:51 +0000 (09:43 -0400)]
llama : add support for Tekken pre-tokenizer (#8579)
* llama : Added support for Tekken pre-tokenizer (#8577)
Removed uneeded `vocab.tokenizer_clean_spaces` assignment
* llama : fix order of pre-tokenizers
* * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces
* Updated chkhsh for Tekken tokenizer
---------
Co-authored-by: Georgi Gerganov <redacted>
Huifeng Ou [Sat, 20 Jul 2024 13:09:37 +0000 (09:09 -0400)]
llama.swiftui: fix end of generation bug (#8268)
* fix continuing generating blank lines after getting EOT token or EOS token from LLM
* change variable name to is_done (variable name suggested by ggerganov)
* minor : fix trailing whitespace
* minor : add space
---------
Co-authored-by: Georgi Gerganov <redacted>
Brian [Sat, 20 Jul 2024 07:35:25 +0000 (17:35 +1000)]
gguf_dump.py: fix markddown kv array print (#8588)
* gguf_dump.py: fix markddown kv array print
* Update gguf-py/scripts/gguf_dump.py
Co-authored-by: compilade <redacted>
* gguf_dump.py: refactor kv array string handling
* gguf_dump.py: escape backticks inside of strings
* gguf_dump.py: inline code markdown escape handler added
>>> escape_markdown_inline_code("hello world")
'`hello world`'
>>> escape_markdown_inline_code("hello ` world")
'``hello ` world``'
* gguf_dump.py: handle edge case about backticks on start or end of a string
---------
Co-authored-by: compilade <redacted>
slaren [Fri, 19 Jul 2024 15:17:27 +0000 (17:17 +0200)]
ggml : fix quant dot product with odd number of blocks (#8549)
* ggml : fix iq4_nl dot product with odd number of blocks
* ggml : fix odd blocks for ARM_NEON (#8556)
* ggml : fix iq4_nl dot product with odd number of blocks
* ggml : fix q4_1
* ggml : fix q5_0
* ggml : fix q5_1
* ggml : fix iq4_nl metal
ggml-ci
* ggml : fix q4_0
* ggml : fix q8_0
ggml-ci
* ggml : remove special Q4_0 code for first 2 blocks
* ggml : fix sumf redefinition
---------
Co-authored-by: slaren <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Brian [Fri, 19 Jul 2024 14:04:38 +0000 (00:04 +1000)]
convert-*.py: remove add_name from ChatGLMModel class (#8590)