Skip to main content
June 30, 20269:03

Finally, The CORRECT Way to Run Local AI on a Mac

By Samuel Gregory

About this video

SaaS is becoming an unnecessary tax on innovation. This video explores why OMLX is the definitive choice for founders looking to reclaim their data and run powerful LLMs locally on Mac hardware. Key Takeaways: - Why OMLX is superior to Ollama and LM Studio for professional Mac workflows. - The technical benefits of SSD-backed caching and LRU policies for persistent context. - How to set up agentic models like Qwen 3.6 MoE for real-world coding tasks. - A breakdown of why the M5 Max is the current sweet spot for personal AI infrastructure. - Practical steps to integrate local models into tools like Pie and Open Code.

{
  "schema": "https://opencode.ai/config.json",
  "provider": {
    "omlx": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "oMLX (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8000/v1",
        "apiKey": "YOUR-API-KEY"
      },
      "models": {
        "mlx-community/Qwen3.6-35B-A3B-OptiQ-4bit": {
          "name": "Qwen 3.6 OptiQ 35B"
        }
      }
    }
  }
}

The Death of SaaS and the Rise of Personal Software

The era of the SaaS tax is coming to an abrupt end. For years, founders and CEOs have been told that the only way to access state-of-the-art intelligence is through a monthly subscription to a centralised provider. This is no longer the case. With the advent of M5 silicon and sophisticated local runners like OMLX, the 'Personal Software' movement is finally here.

Why Local Beats Cloud for Founders

Data is the most valuable asset any organisation possesses. Sending that data to a third-party server to be processed by a generic model is not only a security risk: it is an inefficiency. By running models locally, you ensure zero latency, total privacy, and a cost structure that scales with your hardware investment rather than your user count.

The OMLX Advantage

In my recent deep dive, I landed on OMLX as the premier solution for Mac users. It builds upon the MLX LLM library but adds a critical layer of professional features:

  1. SSD Caching: Unlike traditional setups, OMLX stores your cache on the SSD using an LRU (Least Recently Used) policy. This means cold starts are faster and your context persists across server restarts.
  2. Resource Efficiency: It avoids the bloat of electron-based alternatives. When you are running 128GB of RAM, you want that memory dedicated to the model, not the UI.
  3. Tiered Architecture: Hot blocks stay in RAM while cold blocks move to the SSD, allowing for massive context windows without crashing the system.

The Hardware Reality

To truly replace SaaS, you need the right kit. Running Mixture of Expert (MoE) models like Qwen or Gemma requires significant unified memory. My current setup on the M5 Max with 128GB of RAM allows for agentic workflows that rival GPT-4, all without a single packet leaving my local network.

Conclusion

The future of software is not a subscription: it is an asset. By investing in local infrastructure and open-source models, founders can build a proprietary intelligence stack that they actually own. It is time to stop renting your brain from Big Tech.

Transcript

We talk a lot about local LLMs on this channel. I have done so many tests running Turbo Quant, GGML versus MLX running on M5, running on M1, large RAM, small RAM. Where have I landed truly on this whole discovery piece? Where am I at June 2026 when it comes to running local LLMs? We are going to get it set up with your agent of choice, whether it is Claude, Code, Pi, Open Claude, or Hermes. And of course, I am just going to discuss my rationale behind everything and make it super clear for you to understand.

I have landed on OMLX here. This seems to be the cleanest way to run open-source local LLMs on your Mac. And there are a few reasons why. We have taken a look at MLX LLM, which is a Python library. I have taken a look at Osorus, which I do not think I have done a video on yet, but it might be of interest to you. MLX does something really special. It does not necessarily replace MLX LLM: it just adds all of the necessary things on top of it, such as running a server, plus a cheeky additional thing which effectively stores your cache on the SSD, even in between server resets.

Basically, what that means is that you are going to get faster results from cold starts of your LLM. Cache blocks persist to disk in safe tensor format. This is a two-tier architecture: hot blocks stay in RAM and cold blocks go to SSD with an LRU policy. Least recently used. Okay, so if it is older, it will push it to the back or delete it. Previously seen prefixes are restored across requests and server restarts, never recomputed.

It also gives us a really nice UI, which we will get into, especially if you are running multiple agents or different harnesses. Now, this is actually using M3 ultra 512 GB of RAM. I am using an M5 max with 128 GB of RAM, but certainly more than capable of running local LLMs. You can see here as they are building up the context, you are really not sacrificing a tremendous amount as you build up that context and that caching is kicking in, you are seeing an increase in speed there versus general native KV cache storage from MLX LM. This OMLX just adds a lot of features on top of that.

I have looked into Ollama versus LM Studio. Ollama is slowly rolling out support for MLX models. LM Studio gives you a nice UI for downloading different models and specifically targeting MLX ones. It is a great way to get started, but it is becoming very bloated and I wanted something a lot more refined. If I am running local LLMs on my machine, I want something that is less resource intensive. I do not want applications running: I want to preserve my RAM for my LLMs.

You can download it from GitHub. They have pre-releases available, but you can just download the DMG. From the settings, you can open the web dashboard. All of our models basically download from Hugging Face. You can search for the model you want, like Gemma 4 12B. I tend to search MLX and find the best quantized version. You can then just copy this URL back into settings.

For agentic work, really we are looking at mixture of expert models. We have the Qwen 3.6 mixture of expert model at 8-bit. Once downloaded, you can see it in your models. The server will give you a bunch of endpoints which tell you what you need to actually do. If we select Qwen, this will be the code that you run to get it to run with our model.

The biggest aspect to this is looking at the cache. We can see what sort of speed we are getting, how many tokens have been cached, and then the efficiency gains from that. We are using about 80 gig of RAM to do that as well. Even though the model is 36, the context has an impact on the amount of RAM that you need. Because of that SSD storing of the cache, you will not lose any of those performance gains that you have done with your work so far.

I tend to run my local models in open code because Claude code is known for context bloating. I want to preserve context because I have only limited hardware. In things like Pie where you can use multiple different providers, we have created a new one called OMLX linking to the base URL. This was helpful for me landing on OMLX. I am going to run my models this way in future. Stay tuned for more videos on training and building apps with these LLMs. Like, subscribe, and keep on vibing.