Mac M Series Generative Models tech stacks

This is just a brain dump of what I know right now about doing inference and fine tuning on Apple Silicon Mac as of mid October, 2025 (lot more to cover, WIP):

mlx is Apple's open library optimized for Apple silicon
- mlx is analogous to pytorch but optimized directly for mac, so better than pytorch with mps backend/device on the same mac for most use cases.
- mlx-lm is the inference and fine tuning library for mlx language models.
- mlx-community on huggingface has a vibrant ecosystem to create models.
- The preferred way to fine tune an lm on mac (as of mid Oct 2025) is with LoRA or QLoRA fine tuning - here's the official mlx-lm doc linking to the memory section - gives some sense of the process and memory expectations (not too bad, imo) - https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/LORA.md#memory-issues
- mlx-lm also supports full fine tuning, but that will only support even smaller models and datasets for fine tuning.
- You can do inference with:
  - ollama server (only with gguf models, not mlx),
  - mlx-lm for mlx models only
  - lmstudio as server (gguf and mlx and iirc hf models directly too?)
  - pytorch with mps (directly, but usually via huggingface APIs),
  - use Apple's CoreML for some foundation models (mostly for other modalities, I think) on Mac.
  - open-webui as server (supports many backends).
  - litellm with ollama, lm studio, etc. as backend providers.
- For fine tuning, you can use
  - mlx-lm as described above
  - huggingface/pytorch/mps - but prefer to use mlx-lm if available for the model you want.
- Other random trivia:
  - mlx also supports linux (only on mac, I wd guess)
  - there is also an mlx-vlm for vision models with mlx - but not as mature as mlx-lm last I checked.
  - For other modalities - image gen, audio gen, video gen, etc. - Core ML supports some of these, and for some, huggingface/pytorch/mps is the only alternative (I have used it for image generation - and it is decent, but definitely slower than on CUDA devices).

Conclusion

For lm inference:
- for plain vanilla text to text chat - mlx-lm > ollama > others
- for more complex features (structured output, thinking vs non-thinking, vision, etc.) - ollama > mlx-lm/mlx-vlm > others
for other modalities:
- Apple's Core ML, if supported, else huggingface with mps
for LoRA fine tuning:
- mlx-lm > pytorch/mps
for full fine tuning: not sure, but I think pytorch/mps is more mature for that than mlx-lm.

All this is as of Oct 2025 - will change, over time.

Mac M Series Generative Models tech stacks

Conclusion

Comments

More from this blog

List of web Chat AI models ( as of Feb 17, 2025)

A penny for DeepSeek-R1's thoughts

Evaluating Local Models with Custom Datasets

DeepSeek-R1 thinks in Cursor

Command Palette

Conclusion

Comments

More from this blog