* New Feature:
1. Sum_Rows:
fix cuda kernel overflow
fix block shape error when nrows too big
2. Im2Col:
Support Batch in cuda
Support f32 to f32 both in cpu && cuda
3. DepthWiseConv:
Support by Im2Col && MulMat
4. Pool_2d:
Supoort avg pooling in cuda
5. HardSigmoid:
Imp in cuda
6. HardSwish:
Imp in cuda
* fix tabs instead of spaces
* code clean
* CUDA POOL2D
* ADD POOL2D test case in test-backend-ops.cpp
* code clean
* fix pool2d_kernel
nits
* fix bug in pool2d kernel
* fix avg pooling, count_include_pad
nits
* test-backend-ops : add more pool_2d tests
* cuda : fix warnings and formatting
* ggml : check types in release builds too in pool_2d