Xuan Son Nguyen [Sat, 15 Jun 2024 16:53:40 +0000 (18:53 +0200)]
Add `cvector-generator` example (#7514)
* add control-vector-generator
* calc diff
* add comments
* proof-of-concept stdlib implementation
Implements PCA and file writing using mostly standard libraries. The output is recognized as a functional control vector, but outputs gibberish.
* param parsing, refactor, comments
Added basic command-line parameters for outfile and one each positive/negative prompt.
Refactored some messy code in PCA computation and GGUF exporting.
Left a bunch of comments regarding further work needed.
* example template completions
Implements an example template set built from the positive/negative prompts like the control vector Python implementation.
* add multi prompts, multi-thread for PCA
* fix mem error
* add debugs
* fix matrix transpose multiplication
you have got to be kidding me
* preliminary template/multiprompt support
model is running out of context and that ought to be fixed (segfaulting) but other than that it looks goodish
* fix zero output & param parsing, functional templating
fixed a bug where the output file had no tensor data/was all zero
fixed a bug where single hyphen flags were not being correctly parsed
implements creation of templated prompts from input (still need to adapt based on model)
* fix square_diff matmul index range and CRLF->LF line endings
fixed a logic error where square_diff would not multiply all rows
fixed a formatting error where the provided completions.txt had CRLF line endings
* add command-line args for num threads, num completions file lines, always reload model
refactored a few things and did what the commit message says on the tin
* code aestheticization
* fix compiler warnings
* in-series multithreading for prompt embedding?
added commented-out code to attempt to start implementing mutlithreading for embedding in main
* remove unnecessary multithreading
* interim fix memory leak
* translated everything but PCA (I think)
* tentatively translate the rest
* fix ggml errors and make new ones
at least it compiles and runs
* fix cb_eval
* temporary commit while I move dev environments
it finally outputs a functioning control vector - "functioning" in the sense that it can be loaded and it clearly has the right idea, but makes the model incoherent
* update debug statements
* pre-tokenize so we can allocate correct memory to ctx_diffs_wrapped
* update comments
* (wip) refactor
* clean up PCA ggml implementation
* fix shape of v_diff_original
* add n_batch for pca
* working version
* remember to copy back the last_eigenvector
* fix n_completions
* bring back n_completions
* default n_pca_batch to 20
* fix macos build
* add to makefile all targets
* use ggml_format_name
* add readme
* fix .editorconfig
* use ggml_backend_tensor_copy
* attemp to fix compile problem on mac
* fix compile warn
* reuse allocr
* move param parser to common
* better error handling
* clean up a bit
* add print_usage
* shorten help msg
* beautify help msg
* escape prompt by default
* change compile target to llama-cvector-generator
slaren [Thu, 13 Jun 2024 01:11:35 +0000 (03:11 +0200)]
move BLAS to a separate backend (#6210)
* move BLAS to a separate backend
* rename GGML_USE_OPENBLAS to GGML_USE_BLAS
* alloc : reuse same buffer when the same buffer type if used multiple times
* set number of threads automatically for openblas and blis
* sched : print assignments when GGML_SCHED_DEBUG env variable is set
* sched : allow ops with weights on an incompatible buffer type
This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.
Nicolás Pérez [Sun, 9 Jun 2024 15:24:29 +0000 (11:24 -0400)]
docs: Added initial PR template with directions for doc only changes and squash merges [no ci] (#7700)
This commit adds pull_request_template.md and CONTRIBUTING.md . It focuses on explaining to contributors the need to rate PR complexity level, when to add [no ci] and how to format PR title and descriptions.
Co-authored-by: Brian <redacted> Co-authored-by: compilade <redacted>
compilade [Sun, 9 Jun 2024 02:47:25 +0000 (22:47 -0400)]
convert-hf : match model part name prefix and suffix (#7687)
In #7075, to fix the conversion of (some) models using model-00001-of-00001.safetensors instead of model.safetensors for a single model part we simply used the same logic as the part count to get the part names.
But this doesn't always work correctly, like when unusual additional model files like consolidated.safetensors in https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 are present.
This commit matching both the prefix and the suffix of the model part names should fix this problem without breaking any previously-supported upstream models. But according to report by @teleprint-me there is still some
persistent problem, but shall do in the meantime.
agray3 [Tue, 4 Jun 2024 20:06:49 +0000 (21:06 +0100)]
Allow number of nodes in CUDA graph to change (#7738)
Previously the code would have failed to cope in the case that the
number of nodes changes in an existing CUDA graph. This fixes the
issue by removing an unnecessary conditional.
jaime-m-p [Tue, 4 Jun 2024 07:17:17 +0000 (09:17 +0200)]
Per token attributes (#7685)
* Add per token attributes enum
* Using phi-3 for testing 'rstrip'
* Using jina-v2 for testing 'lstrip'
* Brute force test for 'lstrip' and 'rstrip'
* Implement 'rstrip' and 'lstrip'
* Update phi-3 GGUF file (obsolete since 917dc8c)
* Replace llama_token_type with llama_token_attribs
Georgi Gerganov [Tue, 4 Jun 2024 07:01:09 +0000 (10:01 +0300)]
ggml : prevent builds with -ffinite-math-only (#7726)
This enforces a check that -fno-finite-math-only was set and that the operating
compiling mode is not in finite maths mode. This is because during rewriting of
silu and softmax for cpu #7154 there emerged an issue where the result that was
observed when >1 slot was nondeterministic as found by @JohannesGaessler.
@LostRuins narrowed the problem down to -ffinite-math-only which was theorised
to be due to SiLU, instead of flushing small values to 0, returns NaN or some
other garbage. @jart proposed a fix that @ggerganov then implemented in this fix
* SimpleChat:DU:BringIn local helper js modules using importmap
Use it to bring in a simple trim garbage at end logic, which is
used to trim received response.
Also given that importmap assumes esm / standard js modules, so
also global variables arent implicitly available outside the
modules. So add it has a member of document for now
* SimpleChat:DU: Add trim garbage at end in loop helper
* SimpleChat:DU:TrimGarbage if unable try skip char and retry
* SimpleChat:DU: Try trim using histogram based info
TODO: May have to add max number of uniq chars in histogram at
end of learning phase.
* SimpleChat:DU: Switch trim garbage hist based to maxUniq simple
Instead of blindly building histogram for specified substring
length, and then checking if any new char within specified min
garbage length limit, NOW exit learn state when specified maxUniq
chars are found. Inturn there should be no new chars with in
the specified min garbage length required limit.
TODO: Need to track char classes like alphabets, numerals and
special/other chars.
* SimpleChat:DU: Bring in maxType to the mix along with maxUniq
Allow for more uniq chars, but then ensure that a given type of
char ie numerals or alphabets or other types dont cross the
specified maxType limit. This allows intermixed text garbage
to be identified and trimmed.
* SimpleChat:DU: Cleanup debug log messages
* SimpleChat:UI: Move html ui base helpers into its own module
Some models like llama3 found to try to be over intelligent by
repeating garbage still, but by tweaking the garbage a bit so that
it is not exactly same. So avoid setting these penalties and let
the model's default behaviour work out, as is.
Also the simple minded histogram based garbage trimming from end,
works to an extent, when the garbage is more predictable and
repeatative.
* SimpleChat:UI: Add and use a para-create-append helper
Also update the config params dump to indicate that now one needs
to use document to get hold of gMe global object, this is bcas of
moving to module type js.
Also add ui.mjs to importmap
* SimpleChat:UI: Helper to create bool button and use it wrt settings
* SimpleChat:UI: Add Select helper and use it wrt ChatHistoryInCtxt
* SimpleChat:UI:Select: dict-name-value, value wrt default, change
Take a dict/object of name-value pairs instead of just names.
Inturn specify the actual value wrt default, rather than the
string representing that value.
Trap the needed change event rather than click wrt select.
* SimpleChat:UI: Add Div wrapped label+element helpers
Move settings related elements to use the new div wrapped ones.
* SimpleChat:UI:Add settings button and bring in settings ui
* SimpleChat:UI:Settings make boolean button text show meaning
* SimpleChat: Update a bit wrt readme and notes in du
* SimpleChat: GarbageTrim enable/disable, show trimmed part ifany
Make it easy for end user to identified the trimmed text.
Make garbage trimming logic, consider a longer repeat garbage
substring.
* SimpleChat: Cleanup a bit wrt Api end point related flow
Consolidate many of the Api end point related basic meta data into
ApiEP class.
Remove the hardcoded ApiEP/Mode settings from html+js, instead use
the generic select helper logic, inturn in the settings block.
Move helper to generate the appropriate request json string based
on ApiEP into SimpleChat class itself.
* SimpleChat:Move extracting assistant response to SimpleChat class
so also the trimming of garbage.
* SimpleChat:DU: Bring in both trim garbage logics to try trim
* SimpleChat: Cleanup readme a bit, add one more chathistory length
* SimpleChat:Stream:Initial handshake skeleton
Parse the got stream responses and try extract the data from it.
It allows for a part read to get a single data line or multiple
data line. Inturn extract the json body and inturn the delta
content/message in it.
* SimpleChat: Move handling oneshot mode server response
Move handling of the oneshot mode server response into SimpleChat.
Also add plumbing for moving multipart server response into same.
* SimpleChat: Move multi part server response handling in
* SimpleChat: Add MultiPart Response handling, common trimming
Add logic to call into multipart/stream server response handling.
Move trimming of garbage at the end into the common handle_response
helper.
Add new global flag to control between oneshot and multipart/stream
mode of fetching response. Allow same to be controlled by user.
If in multipart/stream mode, send the stream flag to the server.
* SimpleChat: show streamed generative text as it becomes available
Now that the extracting of streamed generated text is implemented,
add logic to show the same on the screen.
* SimpleChat:DU: Add NewLines helper class
To work with an array of new lines. Allow adding, appending,
shifting, ...
* SimpleChat:DU: Make NewLines shift more robust and flexible
* SimpleChat:HandleResponseMultiPart using NewLines helper
Make handle_response_multipart logic better and cleaner. Now it
allows for working with the situation, where the delta data line
got from server in stream mode, could be split up when recving,
but still the logic will handle it appropriately.
ALERT: Rather except (for now) for last data line wrt a request's
response.
* SimpleChat: Disable console debug by default by making it dummy
Parallely save a reference to the original func.
* SimpleChat:MultiPart/Stream flow cleanup
Dont try utf8-decode and newlines-add_append if no data to work on.
If there is no more data to get (ie done is set), then let NewLines
instance return line without newline at end, So that we dont miss
out on any last-data-line without newline kind of scenario.
Pass stream flag wrt utf-8 decode, so that if any multi-byte char
is only partly present in the passed buffer, it can be accounted
for along with subsequent buffer. At sametime, bcas of utf-8's
characteristics there shouldnt be any unaccounted bytes at end,
for valid block of utf8 data split across chunks, so not bothering
calling with stream set to false at end. LATER: Look at TextDecoder's
implementation, for any over intelligence, it may be doing..
If needed, one can use done flag to account wrt both cases.
* SimpleChat: Move baseUrl to Me and inturn gMe
This should allow easy updating of the base url at runtime by the
end user.
* SimpleChat:UI: Add input element helper
* SimpleChat: Add support for changing the base url
This ensures that if the user is running the server with a
different port or wants to try connect to server on a different
machine, then this can be used.
* SimpleChat: Move request headers into Me and gMe
Inturn allow Authorization to be sent, if not empty.
* SimpleChat: Rather need to use append to insert headers
* SimpleChat: Allow Authorization header to be set by end user
* SimpleChat:UI+: Return div and element wrt creatediv helpers
use it to set placeholder wrt Authorization header.
This can help ensure that data fetched till that point, can be
made use of, rather than losing it.
On some platforms, the time taken wrt generating a long response,
may lead to the network connection being broken when it enters
some user-no-interaction related power saving mode.
* SimpleChat:theResp-origMsg: Undo a prev change to fix non trim
When the response handling was moved into SimpleChat, I had changed
a flow bit unnecessarily and carelessly, which resulted in the non
trim flow, missing out on retaining the ai assistant response.
This has been fixed now.
* SimpleChat: Save message internally in handle_response itself
This ensures that throwing the caught exception again for higher
up logic, doesnt lose the response collated till that time.
Go through theResp.assistant in catch block, just to keep simple
consistency wrt backtracing just in case.