ggml : full broadcast in mul, add, div + ggml_mul_mat_id, ggml_argsort, ggml_top_k (#625)
* ggml : support broadcasting in dim 0 in add and mul
* add cuda add/mul broadcast impl
add configurable eps to cuda norm
* add metal impl
ggml-ci
* deduplicate code in cuda impl
* try to optimize cuda impl
* ggml : support broadcasting in ggml_div
* test-backend-ops : allow filtering by op and backend
* ggml-cuda : add ggml_div impl
* ggml : add ggml_mul_mat_id, ggml_sort, ggml_top_k (CPU only)
* fix ggml_div threads
* fix ggml_div with accelerate
* ggml_sort -> ggml_argsort
* whatever
* actually fix accelerate div
* disable opencl ci
* ci : disable ctest error check temporarily until we fix backend ops test
* cmake : propagete GGML_USE_xxx compile flags with ggml target
* whisper : utlize new ggml_add broadcast for dim 0
* cmake : adendum to
ee666ae9
* ggml_backend_graph_copy : fix leak
* ggml_cuda : add ggml_sum_rows impl
* metal : add ggml_div
* metal : add ggml_sum_rows
* ggml_cuda : add ggml_argsort impl
* move kernel
* metal : add ggml_argsort
* mul_mat_id : fix missing init task
* cuda/metal: fix argsort synchronization
* metal : add ggml_mul_mat_id
* ggml-cuda : add mul_mat_id for f16 + tensor cores
* test-backend-ops : add tests for quants mat mul
* ggml : fix q5_0 and q5_1 hist stats
* test-backend-ops : use smaller matrices to avoid automatic offloading, add mat-vec tests
* metal : fix alibi to match the CPU behavior
* metal : check dimensions in supports_op
* test-backend-ops : reduce error threshold for mat muls
* ggml-cuda : simplify dequantize funs, add supports_op by type for mul_mat_id
* ggml-cuda : support quantized types in mul_mat_id with cublas
* ggml-cuda : add fallback over CPU for mul_mat_id
* test-backend-ops : increase mul mat error threshold
* cleanup
ggml-ci
* test-backend-ops : fix usage
* cleanup
* ci : re-enable tests
* metal : fix compile warnings
---------
Co-authored-by: Georgi Gerganov <redacted>