Here's a fun way to start interacting with it (this loads and runs Llama 3.2 3B in a terminal chat UI):
uv run --isolated --with mlx-lm python -m mlx_lm.chat
For 4 bit deepseek-r1-distill-llama-70b on a Macbook Pro M4 Max with the MLX version on LM Studio: 10.2 tok/sec on power and 4.2 tok/sec on battery / low power
For 4 bit gemma-3-27b-it-qat I get: 26.37 tok/sec on power and on battery low power 9.7
It'd be nice to know all the possible power tweaks to get the value higher and get additional insight on how llm's work and interact with the cpu and memory.
But as a measure for what you can achieve with a course like this: does anyone know what the max tok/s vs iPhone model plot look like, and how does MLX change that plot?