ggml : improve ADD_REL_POS perf in SAM by doing it inplace + broadcast BLAS mul_mat (#466)
* Improve ADD_REL_POS perf in SAM by doing it inplace
- Add unit tests for the ADD_REL_POS operation
- I am not sure if this is valid implementation as we reuse the src0
memory in order to avoid copying it
- When running SAM with the "Example output" command, image, point and
16 threads, this reduces the cumulative time of the ADD_REL_POS operation
from 1000-1100 ms to 180-200ms
- There is further room for optimization in the access patterns used in
the implementation of the opration
* Add non-inplace version for the GGML_OP_ADD_REL_POS
* Fix map_unary warnings and refactor LayerNorm2d + remove ggml_cont in it
* Fix Mac printf format warnings
* sam : add ggml_graph_print() comment
* ggml : add broadcast support for BLAS ggml_mul_mat() (#460)
* Remove not needed build_forward_expand from add-rel-pos unit test