* address review: replace FATTN_WARP_SIZE with constexpr, improve dispatch
- Replace #define FATTN_WARP_SIZE with constexpr int warp_size =
ggml_cuda_get_physical_warp_size() in each device function
- Use ne[1]*gqa_ratio threshold for MMA vs tile dispatch. Benchmarked
crossover on MI300X @ d32768 with power-of-2 GQA models:
hsk=64 (Llama 1B, gqa=4): MMA wins at eff >= 128 (+11%)
hsk=128 (Llama 3B, gqa=4): MMA wins at eff >= 128 (+4%)
Unified threshold: eff_nq >= 128 for all head sizes.
- Remove VEC fallback; small batches fall through to tile kernel
* Update ggml/src/ggml-cuda/fattn.cu
* use ggml_cuda_info().devices warp_size instead of hardcoded check