Shelby Jenkins [Tue, 4 Feb 2025 11:20:55 +0000 (05:20 -0600)]
readme : add llm_client Rust crate to readme bindings (#11628)
[This crate](https://github.com/ShelbyJenkins/llm_client) has been in a usable state for quite awhile, so I figured now is fair to add it.
It installs from crates.io, and automatically downloads the llama.cpp repo and builds it for the target platform - with the goal being the easiest user experience possible.
It also integrates model presets and choosing the largest quant given the target's available VRAM. So a user just has to specify one of the presets (I manually add the most popular models), and it will download from hugging face.
So, it's like a Rust Ollama, but it's not really for chatting. It makes heavy use of llama.cpp's grammar system to do structured output for decision making and control flow tasks.
Christian Kastner [Mon, 3 Feb 2025 23:17:15 +0000 (00:17 +0100)]
cmake: Add ability to pass in GGML_BUILD_NUMBER (ggml/1096)
This makes git as a dependency optional, and is useful in the case where
ggml is built not from git, but from a tarball, or a distribution source
package.
This conditional also affects GGML_BUILD_COMMIT. Nothing seems to be
using it, though, so there doesn't seem much value factor it out, or
even require it.
Daniel Bevenius [Mon, 3 Feb 2025 15:45:38 +0000 (16:45 +0100)]
server : remove CPPHTTPLIB_NO_EXCEPTIONS define (#11622)
This commit removes the CPPHTTPLIB_NO_EXCEPTIONS define from the server
code.
The motivation for this is that when using a debug build the server
would crash when an exception was throws and terminate the server
process, as it was unhandled. When CPPHTTPLIB_NO_EXCEPTIONS is set
cpp_httplib will not call the exception handler, which would normally
return a 500 error to the client. This caused tests to fail when using
a debug build.
Steve Grubb [Fri, 31 Jan 2025 05:58:55 +0000 (00:58 -0500)]
common: Add missing va_end (#11529)
The va_copy man page states that va_end must be called to revert
whatever the copy did. For some implementaions, not calling va_end
has no consequences. For others it could leak memory.
Daniel Bevenius [Fri, 31 Jan 2025 05:04:53 +0000 (06:04 +0100)]
server : update help metrics processing/deferred (#11512)
This commit updates the help text for the metrics `requests_processing`
and `requests_deferred` to be more grammatically correct.
Currently the returned metrics look like this:
```console
\# HELP llamacpp:requests_processing Number of request processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of request deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```
With this commit, the metrics will look like this:
```console
\# HELP llamacpp:requests_processing Number of requests processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of requests deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```
This is also consistent with the description of the metrics in the
server examples [README.md](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#get-metrics-prometheus-compatible-metrics-exporter).
Daniel Bevenius [Thu, 30 Jan 2025 10:05:00 +0000 (11:05 +0100)]
server : use lambda instead of std::bind (#11507)
This commit replaces the two usages of `std::bind` in favor of lambdas for
the callback functions for `callback_new_task` and
`callback_update_slots`.
The motivation for this changes is consistency with the rest of the code
in server.cpp (lambdas are used for all other callbacks/handlers). Also
lambdas are more readable (perhaps this is subjective) but also they are
recommended over `std::bind` in modern C++.
Daniel Bevenius [Thu, 30 Jan 2025 04:48:14 +0000 (05:48 +0100)]
server : update json snippets in README.md [no ci] (#11492)
This commit updates some of JSON snippets in README.md file and
removes the `json` language tag from the code blocks.
The motivation for this changes is that if there is invalid json in a
code snippet these are highlighted in red which can make it somewhat
difficult to read and can be a little distracting.
Daniel Bevenius [Wed, 29 Jan 2025 15:34:18 +0000 (16:34 +0100)]
server : update auto gen files comments [no ci] (#11484)
* server : update auto gen files comments
This commit updates the 'auto generated files' comments in server.cpp
and removes `deps.sh` from the comment.
The motivation for this change is that `deps.sh` was removed in
Commit 91c36c269bca75b2d08119c653512cd20b4ea2ba ("server : (web ui)
Various improvements, now use vite as bundler (#10599)").
* squash! server : update auto gen files comments [no ci]
Move comments about file generation to README.md.
* squash! server : update auto gen files comments [no ci]
Remove the comments in server.cpp that mention that information
can be found in the README.md file.
peidaqi [Tue, 28 Jan 2025 23:03:42 +0000 (16:03 -0700)]
server : Fixed wrong function name in llamacpp server unit test (#11473)
The test_completion_stream_with_openai_library() function is actually with stream=False by default, and test_completion_with_openai_library() with stream=True
uvos [Tue, 28 Jan 2025 22:06:32 +0000 (23:06 +0100)]
HIP: Supress transformation warning in softmax.cu
loops with bounds not known at compile time can not be unrolled.
when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.
Nikita Sarychev [Tue, 28 Jan 2025 15:42:20 +0000 (07:42 -0800)]
HIP: Only call rocblas_initialize on rocblas versions with the multiple instantation bug (#11080)
This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.
Akarshan Biswas [Tue, 28 Jan 2025 09:56:58 +0000 (15:26 +0530)]
SYCL : SOFTMAX F16 mask support and other fixes (#11261)
Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during #5021.
To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it).
Michael Engel [Tue, 28 Jan 2025 08:32:40 +0000 (09:32 +0100)]
Handle missing model in CLI parameters for llama-run (#11399)
The HTTP client in llama-run only prints an error in case the download of
a resource failed. If the model name in the CLI parameter list is missing,
this causes the application to crash.
In order to prevent this, a check for the required model parameter has been
added and errors for resource downloads get propagated to the caller.
Ihar Hrachyshka [Mon, 27 Jan 2025 07:41:59 +0000 (02:41 -0500)]
metal: Handle null returned from MTLCreateSystemDefaultDevice() (#11441)
This fixes segmentation fault error when running tests when no metal
devices are available (for example, when not linked with Core Graphics
framework or otherwise).