]>
git.djapps.eu Git - pkg/ggml/sources/ggml/log
Kawrakow [Sun, 14 Jan 2024 07:45:56 +0000 (09:45 +0200)]
2-bit quantizations (llama/4897)
* imatrix: load
* imatrix: WIP
* imatrix: Add Q2_K quantization
* imatrix: also guard against Q2_K_S quantization without importance matrix
* imatrix: guard even more against low-bit quantization misuse
---------
Co-authored-by: Iwan Kawrakow <redacted>
Georgi Gerganov [Sat, 13 Jan 2024 22:14:03 +0000 (00:14 +0200)]
sync : whisper.cpp
Georgi Gerganov [Sat, 13 Jan 2024 15:47:40 +0000 (17:47 +0200)]
whisper : load the model into multiple buffers of max size 1GB (whisper/1763)
Georgi Gerganov [Sat, 13 Jan 2024 22:09:33 +0000 (00:09 +0200)]
sync : llama.cpp
Georgi Gerganov [Sat, 13 Jan 2024 22:09:26 +0000 (00:09 +0200)]
examples : adapt to metal API
Johannes Gäßler [Sat, 13 Jan 2024 20:41:37 +0000 (21:41 +0100)]
ggml: cache sin/cos for RoPE (llama/4908)
Georgi Gerganov [Sat, 13 Jan 2024 18:45:45 +0000 (20:45 +0200)]
metal : remove old API (llama/4919)
ggml-ci
Georgi Gerganov [Sat, 13 Jan 2024 16:46:37 +0000 (18:46 +0200)]
metal : disable log for loaded kernels (llama/4794)
texmex76 [Sat, 13 Jan 2024 16:06:20 +0000 (17:06 +0100)]
gguf : fix potential infinite for-loop (llama/4600)
Co-authored-by: Bernhard Gstrein <redacted>
Georgi Gerganov [Sat, 13 Jan 2024 16:03:45 +0000 (18:03 +0200)]
metal : refactor kernel loading code (llama/4794)
* metal : detect more GPU families
* metal : refactor kernel loading
* metal : set kernel family requirements
* metal : fix kernel init + fix compile options
* metal : take into account simdgroup reduction support
* metal : print only skipped kernels
* metal : fix check for simdgroup reduction support
* metal : check for Metal 3
* metal : free allocations
* metal : normalize encoder:setComputePipelineStatus calls
ggml-ci
* metal : fix Metal3 family check
ggml-ci
* metal : check for simdgroup matrix mul. feature
ggml-ci
Johannes Gäßler [Fri, 12 Jan 2024 19:38:54 +0000 (20:38 +0100)]
CUDA: faster q8_0 -> f16 dequantization (llama/4895)
Georgi Gerganov [Fri, 12 Jan 2024 20:01:22 +0000 (22:01 +0200)]
sync : whisper.cpp
Georgi Gerganov [Fri, 12 Jan 2024 12:02:30 +0000 (14:02 +0200)]
ggml : fix 32-bit ARM compat for IQ2_XS (whisper/1758)
* ggml : fix 32-bit ARM compat
* ggml : fix fix
* ggml : fix fix fix
Georgi Gerganov [Fri, 12 Jan 2024 11:37:38 +0000 (13:37 +0200)]
whisper : fix segment length with params.no_timestamps == true
George Hindle [Fri, 12 Jan 2024 11:24:38 +0000 (11:24 +0000)]
params : don't compute timestamps when not printing them (whisper/1755)
slaren [Fri, 12 Jan 2024 19:38:34 +0000 (20:38 +0100)]
backend_sched : fix assignments
ggml-ci
Georgi Gerganov [Fri, 12 Jan 2024 19:28:02 +0000 (21:28 +0200)]
examples : remove obsolete starcoder mmap example
ggml-ci
slaren [Fri, 12 Jan 2024 19:23:47 +0000 (20:23 +0100)]
update gpt-2 example
Georgi Gerganov [Fri, 12 Jan 2024 19:15:10 +0000 (21:15 +0200)]
sync : llama.cpp
ggml-ci
slaren [Fri, 12 Jan 2024 19:07:38 +0000 (20:07 +0100)]
llama : ggml-backend integration (llama/4766)
* llama : ggml-backend integration
* ggml-backend : add names to buffers
* fix unmap after loading
* batched-bench : add tensor_split param
* llama : check for null tensor_split
* ggml-backend : increase GGML_MAX_BACKENDS
* improve graph splitting, partial fix for --no-kv-offload
* cuda : add ggml-backend split buffer support
* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)
* ggml : fix null backend dereference (llama/4807)
* ggml : fix null backend dereference
* ggml : also check ggml_backend_is_cpu
* test-backend-ops : check buffer allocation failures
* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)
* ggml : fix mul_mat_id work size
* llama : rewrite session kv load/set without graphs
* minor
* llama : only initialize used backends, free backends on context free
* llama : abort ctx if cuda backend init fails
* llama : rewrite lora with ggml-backend and compute on CPU
ggml-ci
* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer
* opencl : add ggml-backend buffer type
* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)
* llama : on Metal, by default offload the full model
ggml-ci
* metal : page align the data ptr (llama/4854)
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <redacted>
* cuda : fix split buffer free
* address review comments
* llama-bench : add split-mode parameter
* fix whitespace
* opencl : fix double initialization
* server : add --split-mode parameter
* use async copy and compute to improve multi-gpu performance
ggml-ci
* use async memcpys to copy the graph outputs to the CPU
* fix opencl
* use a host buffer for the cpu compute buffer for faster copies to the gpu
---------
Co-authored-by: Georgi Gerganov <redacted>
Co-authored-by: Johannes Gäßler <redacted>
Johannes Gäßler [Fri, 12 Jan 2024 11:30:41 +0000 (12:30 +0100)]
CUDA: fix softmax compile for old CUDA versions (llama/4862)
Kawrakow [Fri, 12 Jan 2024 05:59:57 +0000 (06:59 +0100)]
Importance Matrix calculation (llama/4861)
* imatrix: 1st version
* imatrix: WIP
* Cleanup
* Update examples/imatrix/imatrix.cpp
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Iwan Kawrakow <redacted>
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Thu, 11 Jan 2024 19:56:20 +0000 (21:56 +0200)]
sync : whisper.cpp
Georgi Gerganov [Thu, 11 Jan 2024 19:54:17 +0000 (21:54 +0200)]
sync : ggml
Georgi Gerganov [Mon, 8 Jan 2024 14:41:28 +0000 (16:41 +0200)]
main : add cli option to disable system prints (whisper/1740)
Georgi Gerganov [Thu, 11 Jan 2024 19:49:13 +0000 (21:49 +0200)]
sync : llama.cpp
Kawrakow [Thu, 11 Jan 2024 19:39:39 +0000 (20:39 +0100)]
ggml : SOTA 2-bit quants (add IQ2_XS) (llama/4856)
* iq2_xs: basics
* iq2_xs: this should have been in the basics
* iq2_xs: CUDA and scalar CPU works
* iq2_xs: WIP Metal
* iq2_xs: Metal now works
* iq2_xs: working, but dog slow, ARM_NEON dot product
* iq2_xs: better ARM_NEON dot product
We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.
* iq2_xs: AVX2 dot product - 19.5 t/s
* iq2_xs: faster AVX2 dit product
21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.
* iq2_xs: had forgotten to delete iq2-data.h
* Add llama enum for IQ2_XS
---------
Co-authored-by: Iwan Kawrakow <redacted>
Paul Tsochantaris [Thu, 11 Jan 2024 14:31:52 +0000 (14:31 +0000)]
metal : put encoder debug group behind a define (llama/4873)
Georgi Gerganov [Tue, 9 Jan 2024 17:37:08 +0000 (19:37 +0200)]
metal : improve dequantize precision to match CPU (llama/4836)
ggml-ci
Georgi Gerganov [Tue, 9 Jan 2024 08:42:06 +0000 (10:42 +0200)]
ggml : fix vld1q_s8_x4 32-bit compat (llama/4828)
* ggml : fix vld1q_s8_x4 32-bit compat
ggml-ci
* ggml : fix 32-bit ARM compat (cont)
ggml-ci
Johannes Gäßler [Tue, 9 Jan 2024 07:58:55 +0000 (08:58 +0100)]
CUDA: faster softmax via shared memory + fp16 math (llama/4742)
Georgi Gerganov [Thu, 11 Jan 2024 07:34:59 +0000 (09:34 +0200)]
metal : fix deprecation warning (#690)
Timothy Cronin [Thu, 11 Jan 2024 07:27:48 +0000 (02:27 -0500)]
ggml : remove ggml_cpy_inplace and ggml_cont_inplace (#693)
otaGran [Wed, 10 Jan 2024 16:12:18 +0000 (11:12 -0500)]
Update README.md (#692)
Jack Mousseau [Wed, 10 Jan 2024 14:19:19 +0000 (06:19 -0800)]
metal : wrap each operation in debug group (#690)
leejet [Wed, 10 Jan 2024 13:13:42 +0000 (21:13 +0800)]
ggml : change GGML_MAX_NAME at compile time (#682)
* change GGML_MAX_NAME to 128
* allow controlling the value of GGML_MAX_NAME through external macro definitions
Halalaluyafail3 [Tue, 9 Jan 2024 16:16:37 +0000 (11:16 -0500)]
Fix execlp call (#689)
NULL can be an integer constant expression with the value zero, in this case the behavior would be undefined because of an incorrect type being passed to the variable arguments.
Georgi Gerganov [Mon, 8 Jan 2024 21:33:18 +0000 (23:33 +0200)]
sync : llama.cpp
Kawrakow [Mon, 8 Jan 2024 15:02:32 +0000 (16:02 +0100)]
SOTA 2-bit quants (llama/4773)
* iq2_xxs: basics
* iq2_xxs: scalar and AVX2 dot products
Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.
* iq2_xxs: ARM_NEON dot product
Somehow strangely slow (112 ms/token).
* iq2_xxs: WIP Metal
Dequantize works, something is still wrong with the
dot product.
* iq2_xxs: Metal dot product now works
We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s
Not the greatest performance, but not complete garbage either.
* iq2_xxs: slighty faster dot product
TG-128 is now 48.4 t/s
* iq2_xxs: slighty faster dot product
TG-128 is now 50.9 t/s
* iq2_xxs: even faster Metal dot product
TG-128 is now 54.1 t/s.
Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.
* iq2_xxs: dequantize CUDA kernel - fix conflict with master
* iq2_xxs: quantized CUDA dot product (MMVQ)
We get TG-128 = 153.1 t/s
* iq2_xxs: slightly faster CUDA dot product
TG-128 is now at 155.1 t/s.
* iq2_xxs: add to llama ftype enum
* iq2_xxs: fix MoE on Metal
* Fix missing MMQ ops when on hipBLAS
I had put the ggml_supports_mmq call at the wrong place.
* Fix bug in qequantize_row_iq2_xxs
The 0.25f factor was missing.
Great detective work by @ggerganov!
* Fixing tests
* PR suggestion
---------
Co-authored-by: Iwan Kawrakow <redacted>
Johannes Gäßler [Sun, 7 Jan 2024 16:24:08 +0000 (17:24 +0100)]
CUDA: fixed redundant value dequantization (llama/4809)
Konstantin Zhuravlyov [Sun, 7 Jan 2024 06:52:42 +0000 (01:52 -0500)]
ggml : use __builtin_amdgcn_sudot4 in __dp4a for gfx11 (llama/4787)
Georgi Gerganov [Fri, 5 Jan 2024 13:18:21 +0000 (15:18 +0200)]
ggml : do not sched_yield when calling BLAS (llama/4761)
* ggml : do not sched_yield when calling BLAS
ggml-ci
* ggml : fix do_yield logic
ggml-ci
* ggml : simplify do_yield logic
ggml-ci
Johannes Gäßler [Thu, 4 Jan 2024 08:43:23 +0000 (09:43 +0100)]
Print backend name on test-backend-ops failure (llama/4751)
Georgi Gerganov [Thu, 4 Jan 2024 08:12:26 +0000 (10:12 +0200)]
ggml : include stdlib.h before intrin.h (llama/4736)
旺旺碎冰冰 [Sat, 6 Jan 2024 03:17:47 +0000 (11:17 +0800)]
fix ROCm on Windows (#683)
* fix ROCm on windows
* keep windows does not link m.lib
* add hip missed compile definitions
* never link m.lib
* use find_library for m
---------
Co-authored-by: slaren <redacted>
postmasters [Fri, 5 Jan 2024 15:25:38 +0000 (07:25 -0800)]
gguf : add keys for kv sizes to spec (#676)
* Add keys for kv sizes to GGUF spec
* Fix types of key_length and value_length
Erik Scholz [Fri, 5 Jan 2024 15:00:00 +0000 (16:00 +0100)]
fix : cuda order of synchronization when setting a buffer (#679)
* fix : cuda order of synchronization when setting a buffer
* also sync before memcpy
---------
Co-authored-by: slaren <redacted>
Georgi Gerganov [Fri, 5 Jan 2024 14:30:52 +0000 (16:30 +0200)]
metal : switch back to default.metallib (#681)
ggml-ci
Georgi Gerganov [Fri, 5 Jan 2024 13:36:04 +0000 (15:36 +0200)]
ggml : fix q2_k bpw in comments (#680)
Georgi Gerganov [Thu, 4 Jan 2024 11:37:25 +0000 (13:37 +0200)]
whisper : reset the "batched" timings (whisper/1721)
Finn Voorhees [Wed, 3 Jan 2024 13:39:43 +0000 (08:39 -0500)]
ggml : add error handling to graph_compute (whisper/1714)
Georgi Gerganov [Wed, 3 Jan 2024 12:50:34 +0000 (14:50 +0200)]
scripts : allow to skip commits during sync (#678)
Georgi Gerganov [Wed, 3 Jan 2024 12:18:46 +0000 (14:18 +0200)]
Update src/ggml-cuda.cu
Co-authored-by: slaren <redacted>
Georgi Gerganov [Wed, 3 Jan 2024 11:01:44 +0000 (13:01 +0200)]
cuda : mark I16 and I32 ops as unsupported
ggml-ci
Georgi Gerganov [Wed, 3 Jan 2024 09:35:46 +0000 (11:35 +0200)]
metal : add kernel_get_rows_i32
ggml-ci
Georgi Gerganov [Wed, 3 Jan 2024 09:25:18 +0000 (11:25 +0200)]
sync : llama.cpp
ggml-ci
Georgi Gerganov [Tue, 2 Jan 2024 19:07:47 +0000 (21:07 +0200)]
metal : optimize ggml_mul_mat_id (faster Mixtral PP) (llama/4725)
* ggml : disable fast-math for Metal (cmake build only)
ggml-ci
* metal : fix Metal API debug warnings
* cmake : add -fno-inline for Metal build (llama/4545)
* metal : fix API debug warnings
* metal : fix compile warnings
* metal : use uint64_t for strides
* cmake : rename option to LLAMA_METAL_SHADER_DEBUG
* metal : fix mat-vec Q8_0 kernel for BS > 1
* metal : normalize mat-vec kernel signatures
* cmake : respect LLAMA_QKK_64 option
* metal : fix mat-vec Q4_K kernel for QK_K == 64
* metal : optimizing ggml_mul_mat_id (wip)
* metal : minor fix
* metal : opt mul_mm_id
Georgi Gerganov [Tue, 2 Jan 2024 08:57:44 +0000 (10:57 +0200)]
metal : enable shader debugging (cmake option) (llama/4705)
* ggml : disable fast-math for Metal (cmake build only)
ggml-ci
* metal : fix Metal API debug warnings
* cmake : add -fno-inline for Metal build (llama/4545)
* metal : fix API debug warnings
* metal : fix compile warnings
* metal : use uint64_t for strides
* cmake : rename option to LLAMA_METAL_SHADER_DEBUG
* metal : fix mat-vec Q8_0 kernel for BS > 1
* metal : normalize mat-vec kernel signatures
* cmake : respect LLAMA_QKK_64 option
* metal : fix mat-vec Q4_K kernel for QK_K == 64
ggml-ci
Georgi Gerganov [Sun, 31 Dec 2023 09:43:31 +0000 (11:43 +0200)]
ggml : add ggml_vdotq_s32 alias (llama/4715)
ggml-ci
Johannes Gäßler [Sat, 30 Dec 2023 12:52:01 +0000 (13:52 +0100)]
CUDA: fixed tensor cores not being used on RDNA3 (llama/4697)
automaticcat [Sat, 30 Dec 2023 08:07:48 +0000 (15:07 +0700)]
ggml : add ggml_cpu_has_avx_vnni() (llama/4589)
* feat: add avx_vnni based on intel documents
* ggml: add avx vnni based on intel document
* llama: add avx vnni information display
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* Update ggml.c
Fix indentation upgate
Co-authored-by: Georgi Gerganov <redacted>
---------
Co-authored-by: Georgi Gerganov <redacted>
Johannes Gäßler [Fri, 29 Dec 2023 22:12:53 +0000 (23:12 +0100)]
CUDA: fix tensor core logic for Pascal and HIP (llama/4682)
hydai [Fri, 29 Dec 2023 16:31:19 +0000 (00:31 +0800)]
cuda: fix vmm oom issue on NVIDIA AGX Orin (llama/4687)
Signed-off-by: hydai <redacted>
Georgi Gerganov [Wed, 3 Jan 2024 09:24:48 +0000 (11:24 +0200)]
scripts : fix sync order + metal sed
Ashraful Islam [Wed, 3 Jan 2024 09:07:30 +0000 (03:07 -0600)]
swift : add Swift Package declaration (#674)
* feat: adds swift package declaration
- allows importing ggml as package dependency in llama.cpp and whisper.cpp
- resolves issues with duplicate symbol error when importing llama.cpp and whisper.cpp as package dependency
* fixes the src paths in package.swift
Guillaume Wenzek [Fri, 29 Dec 2023 17:07:03 +0000 (18:07 +0100)]
ggml : extend ggml_get_rows, ggml_repeat, ggml_concat (#639)
* add more int ops
* ggml_compute_forward_dup_bytes
* add tests
* PR comments
* tests : minor indentations
---------
Co-authored-by: Georgi Gerganov <redacted>
Georgi Gerganov [Fri, 29 Dec 2023 13:17:18 +0000 (15:17 +0200)]
scripts : do not sync synced commits
Georgi Gerganov [Fri, 29 Dec 2023 13:08:38 +0000 (15:08 +0200)]
sync : whisper.cpp
Tamotsu Takahashi [Fri, 29 Dec 2023 10:23:27 +0000 (19:23 +0900)]
ci : build with CLBlast + ggml-opencl use GGML_API (whisper/1576)
* Build with CLBlast
* Declare GGML_API
After rebasing, examples/talk-llama failed:
"D:\a\whisper.cpp\whisper.cpp\build\ALL_BUILD.vcxproj" (build target) (1) ->
"D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj" (default target) (14) ->
(Link target) ->
llama.obj : error LNK2019: unresolved external symbol ggml_cl_free_data referenced in function "public: __cdecl llama_model::~llama_model(void)" (??1llama_model@@QEAA@XZ) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
llama.obj : error LNK2019: unresolved external symbol ggml_cl_transform_tensor referenced in function "public: void __cdecl llama_model_loader::load_all_data(struct ggml_context *,void (__cdecl*)(float,void *),void *,struct llama_mlock *)" (?load_all_data@llama_model_loader@@QEAAXPEAUggml_context@@P6AXMPEAX@Z1PEAUllama_mlock@@@Z) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
D:\a\whisper.cpp\whisper.cpp\build\bin\Release\talk-llama.exe : fatal error LNK1120: 2 unresolved externals [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]
Georgi Gerganov [Fri, 29 Dec 2023 13:04:01 +0000 (15:04 +0200)]
scripts : print list of sync commits
Georgi Gerganov [Fri, 29 Dec 2023 12:58:01 +0000 (14:58 +0200)]
scripts : fix format-patch range
Georgi Gerganov [Fri, 29 Dec 2023 12:32:00 +0000 (14:32 +0200)]
scripts : do not sync commits from this repo
Georgi Gerganov [Wed, 27 Dec 2023 10:07:02 +0000 (12:07 +0200)]
sync : whisper.cpp
ggml-ci
bobqianic [Sat, 23 Dec 2023 12:02:58 +0000 (12:02 +0000)]
whisper : Replace WHISPER_PRINT_DEBUG with WHISPER_LOG_DEBUG (whisper/1681)
Georgi Gerganov [Wed, 27 Dec 2023 09:56:42 +0000 (11:56 +0200)]
scripts : add sync-whisper-am.sh
bssrdf [Fri, 29 Dec 2023 08:32:31 +0000 (03:32 -0500)]
ggml : fix some mul mat cases + add tests for src1 F16 (#669)
* fixed mul-mat error for old GPUs
* style fixes
* add mul mat src1 f16 test cases, fix more cases
ggml-ci
---------
Co-authored-by: bssrdf <redacted>
Co-authored-by: slaren <redacted>
Astariul [Fri, 29 Dec 2023 07:43:41 +0000 (16:43 +0900)]
gpt-2 : update README.md (#671)
Mayank Kumar Pal [Fri, 29 Dec 2023 07:35:19 +0000 (13:05 +0530)]
readme : add steps for Android compilation and inference (#664)
* Update README.md for android instructions
* fix minor typo
Georgi Gerganov [Wed, 27 Dec 2023 09:42:45 +0000 (11:42 +0200)]
scripts : fix sed in sync-llama.am.sh
Georgi Gerganov [Wed, 27 Dec 2023 09:06:32 +0000 (11:06 +0200)]
sync : llama.cpp
ggml-ci
Georgi Gerganov [Wed, 27 Dec 2023 09:02:13 +0000 (11:02 +0200)]
ggml : fix dot product for ARM (llama/4630)
ggml-ci
slaren [Tue, 26 Dec 2023 20:23:59 +0000 (21:23 +0100)]
cuda : fix vmm pool with multi GPU (llama/4620)
* cuda : fix vmm pool with multi GPU
* hip
* use recommended granularity instead of minimum
* better error checking
* fix mixtral
* use cudaMemcpy3DPeerAsync
* use cuda_pool_alloc in ggml_cuda_op_mul_mat
* consolidate error checking in ggml_cuda_set_device
* remove unnecessary inlines
ggml-ci
* style fixes
* only use vmm for the main device
* fix scratch buffer size, re-enable vmm pool for all devices
* remove unnecessary check id != g_main_device
WillCorticesAI [Tue, 26 Dec 2023 10:42:08 +0000 (05:42 -0500)]
Update comment for AdamW implementation reference. (llama/4604)
Co-authored-by: Will Findley <redacted>
FantasyGmm [Tue, 26 Dec 2023 10:38:36 +0000 (18:38 +0800)]
Fix new CUDA10 compilation errors (llama/4635)
Georgi Gerganov [Mon, 25 Dec 2023 09:25:19 +0000 (11:25 +0200)]
sync : llama.cpp
ggml-ci
Georgi Gerganov [Mon, 25 Dec 2023 09:00:39 +0000 (11:00 +0200)]
cmake : update CUDA build to support VMM
slaren [Sun, 24 Dec 2023 13:34:22 +0000 (14:34 +0100)]
cuda : improve cuda pool efficiency using virtual memory (llama/4606)
* cuda : improve cuda pool efficiency using virtual memory
* fix mixtral
* fix cmake build
* check for vmm support, disable for hip
ggml-ci
* fix hip build
* clarify granularity
* move all caps to g_device_caps
* refactor error checking
* add cuda_pool_alloc, refactor most pool allocations
ggml-ci
* fix hip build
* CUBLAS_TF32_TENSOR_OP_MATH is not a macro
* more hip crap
* llama : fix msvc warnings
* ggml : fix msvc warnings
* minor
* minor
* cuda : fallback to CPU on host buffer alloc fail
* Update ggml-cuda.cu
Co-authored-by: Johannes Gäßler <redacted>
* Update ggml-cuda.cu
Co-authored-by: Johannes Gäßler <redacted>
* ensure allocations are always aligned
* act_size -> actual_size
---------
Co-authored-by: Johannes Gäßler <redacted>
slaren [Sat, 23 Dec 2023 15:10:51 +0000 (16:10 +0100)]
fallback to CPU buffer if host buffer alloc fails (llama/4610)
Johannes Gäßler [Sat, 23 Dec 2023 08:16:33 +0000 (09:16 +0100)]
CUDA: fixed row rounding for 0 tensor splits (llama/4594)
Georgi Gerganov [Mon, 25 Dec 2023 08:58:54 +0000 (10:58 +0200)]
scripts : fix PR number parsing during sync
Georgi Gerganov [Sun, 24 Dec 2023 13:49:12 +0000 (15:49 +0200)]
scripts : improve llama sync patch
Georgi Gerganov [Sat, 23 Dec 2023 16:05:29 +0000 (18:05 +0200)]
scripts : sync tests / headers
Georgi Gerganov [Sat, 23 Dec 2023 15:54:42 +0000 (17:54 +0200)]
scripts : remove exit
Georgi Gerganov [Sat, 23 Dec 2023 15:54:07 +0000 (17:54 +0200)]
scripts : fix PR number sed
Georgi Gerganov [Sat, 23 Dec 2023 15:49:08 +0000 (17:49 +0200)]
scripts : add sync-llama-am.sh
Georgi Gerganov [Fri, 22 Dec 2023 15:53:50 +0000 (17:53 +0200)]
sync : llama.cpp (ggml_scale, ggml_row_size, ggml_mul_mat_set_prec) (#662)
* sync : llama.cpp (ggml_scale, ggml_row_size, ggml_mul_mat_set_prec)
ggml-ci
* ggml : add comment about backward GGML_OP_DIAG_MASK_INF (#4203)
* llama : fix platforms without mmap (#4578)
* llama : fix platforms without mmap
* win32 : limit prefetch size to the file size
* fix win32 error clobber, unnecessary std::string in std::runtime_error
* ggml-alloc : fix ggml_tallocr_is_own
* whisper : minor
* ggml : cuda jetson + arm quants warnings
ggml-ci
---------
Co-authored-by: Herman Semenov <redacted>
Co-authored-by: slaren <redacted>
slaren [Mon, 18 Dec 2023 17:05:43 +0000 (18:05 +0100)]
cuda : fix synchronization with tensor get/set (#659)
leejet [Mon, 18 Dec 2023 16:46:10 +0000 (00:46 +0800)]
cuda : fix im2col_f32_f16 (#658)
chengchi [Thu, 14 Dec 2023 08:12:31 +0000 (16:12 +0800)]
cmake : change installation path of static libraries to standard directory (#645)
Co-authored-by: Cheng-Chi Wang <redacted>
Georgi Gerganov [Wed, 13 Dec 2023 19:53:20 +0000 (21:53 +0200)]
sync : llama (mul_mat_id + get_rows kernels, typos) (#649)
* sync : llama (mul_mat_id + get_rows kernels, typos)
ggml-ci
* cuda : restore im2col
ggml-ci
* metal : fix accuracy of dequantization kernels
* cuda : restore correct im2col kernel
ggml-ci
* metal : fix moe test by reducing the expert size
ggml-ci
* cuda : fix bin bcast when src1 and dst have different types
---------
Co-authored-by: slaren <redacted>