ModernBERT but without `head.norm` so will currently fail to convert and run any other ModernBERT models, PRs with `head.norm` support welcome!
* constants and tensor mappings for modern bert support, model not supported yet but working on getting conversion to work for encoder only
* conversion now working, hf -> gguf
* working on support, now working on building graph
* some cleanup
* cleanup
* continuing
* correct tensor shape for qkv
* fixed tensor mappings and working on buildin graph
* tensor debugging now works -> (llama-eval-callback), instead of simulated gate split with views, GEGLU is now used which does exactly this
* cleanup
* cleanup
* cleanup
* more cleanup
* ubatch issues, the assert for checking equal seqs in llama-graph.cpp when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more
* added cls token per previous modern bert attempt, still working on checking out the rest
* fixed pre tokenizer and still working through previous pr
* working through previous attemp, implimented more accurate conversion per previous attempt, added local sliding window attention that alternates every third layer
* fixed pre tokenizer
* working on swa with local and global alternating attention
* some cleanup and now fails on build attn
* starting to work, and some cleanup, currently failing on last layer construction in graph build
* alternating rope implemented and modern bert graph build succeeds
* fixed asser for equal ubatch seq
* cleanup
* added mask check in vocab
* fixed alternating rope, the hparams.rope_freq_base_train and hparams.rope_freq_base_train_swa were the same and i set them to correct values
* reuse variable
* removed repeat
* standard swa method can be used instead of a new enum being LLAMA_SWA_TYPE_LOCAL
* correct swa layer indexing, is supposed to be 0, 3, 6 ... instead of 1, 4, 7 ...
* more modular hparam setting
* replaced attn out norm with ffn_norm and cosine similarity between hf embds and llama.cpp embds went way up, from 0.05 to 0.24, replaced the cacheless kv with swa todo per the previous conversion