Love those GPQA scores hovering around 5% when chance (on 4-way multi-choice) would have got them 25%!
More recently, hybrid architectures that utilize attention plus other operators are gaining traction.
IBM is claiming at least a 2x inference speed-up with Bamba. Both groups say that future SSM optimizations to vLLM would lead to further inference speed improvement.
Btw bamba if given to kids at a young age can drastically reduce the chance of peanut allergies
This sounds like what they call "Bamba-9B" is actually an 18B model quantised to 8 bits.
I thought generally we were naming models "nB" by their number of params and treating quantisation as a separate concern. Are there any other models that instead treat the name as an indicative memory requirement?
Is this an attempt to hide that it fares poorly vs other ~18B parameter models?
EDIT: no, I just misunderstood