Skip to main content

Command Palette

Search for a command to run...

Mac M Series Generative Models tech stacks

Published
2 min read
A

I am currently working on generative AI, currently text+image experiences, educational AI generated visual guides in various mediums (comic, video, etc.). Before that, I was building llm apps with chat models, evaluating GPTs and Assistants API. Before that worked in conversational AI. Prior to that, have worked on many things product, software, AI, ML.

This is just a brain dump of what I know right now about doing inference and fine tuning on Apple Silicon Mac as of mid October, 2025 (lot more to cover, WIP):

  • mlx is Apple's open library optimized for Apple silicon
    • mlx is analogous to pytorch but optimized directly for mac, so better than pytorch with mps backend/device on the same mac for most use cases.
    • mlx-lm is the inference and fine tuning library for mlx language models.
    • mlx-community on huggingface has a vibrant ecosystem to create models.
    • The preferred way to fine tune an lm on mac (as of mid Oct 2025) is with LoRA or QLoRA fine tuning - here's the official mlx-lm doc linking to the memory section - gives some sense of the process and memory expectations (not too bad, imo) - https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/LORA.md#memory-issues
    • mlx-lm also supports full fine tuning, but that will only support even smaller models and datasets for fine tuning.
    • You can do inference with:
      • ollama server (only with gguf models, not mlx),
      • mlx-lm for mlx models only
      • lmstudio as server (gguf and mlx and iirc hf models directly too?)
      • pytorch with mps (directly, but usually via huggingface APIs),
      • use Apple's CoreML for some foundation models (mostly for other modalities, I think) on Mac.
      • open-webui as server (supports many backends).
      • litellm with ollama, lm studio, etc. as backend providers.
    • For fine tuning, you can use
      • mlx-lm as described above
      • huggingface/pytorch/mps - but prefer to use mlx-lm if available for the model you want.
    • Other random trivia:
      • mlx also supports linux (only on mac, I wd guess)
      • there is also an mlx-vlm for vision models with mlx - but not as mature as mlx-lm last I checked.
      • For other modalities - image gen, audio gen, video gen, etc. - Core ML supports some of these, and for some, huggingface/pytorch/mps is the only alternative (I have used it for image generation - and it is decent, but definitely slower than on CUDA devices).

Conclusion

  • For lm inference:
    • for plain vanilla text to text chat - mlx-lm > ollama > others
    • for more complex features (structured output, thinking vs non-thinking, vision, etc.) - ollama > mlx-lm/mlx-vlm > others
  • for other modalities:
    • Apple's Core ML, if supported, else huggingface with mps
  • for LoRA fine tuning:
    • mlx-lm > pytorch/mps
  • for full fine tuning: not sure, but I think pytorch/mps is more mature for that than mlx-lm.

All this is as of Oct 2025 - will change, over time.