Oliver Ng. Ai experiments.

  • Gemma4 12b

    With Nvidia announcing the RTX Spark to challenge Apple’s MX reign on local LLMs, it is appropriate that Google dumped some toys on us the next day.

    Gemma4 12B is out today, in a sweet spot between their prior 8B and 26B I’m super interested to see how it stands up. 8B was a great chatbot, but a terrible tool bot. 26B seems much better but I could never run it on my M4 Mac mini 16GB. Does Google see 12B is “about right” for most spec’d out systems today – it runs on 16GB! Downloading now….

    Unexpectedly, Google added to LiteRT-LM a local LLM server / CLI!

    The LiteRT-LM CLI provides a lightweight, zero-code tool for running language models locally. We are now expanding the tool with the serve command, letting the CLI act as a drop-in local LLM server. Use this functionality with Gemma 4 12B to point any standard tool, SDK, or framework (such as OpenClaw, Hermes, OpenCode, Pi, or popular extensions like Continue and Aider) directly to your local endpoint.

    Aside from the fact that the internets are saying RTX spark is going to cost $5000+ it does not seem far away where local models, even on edge devices (looking at you iOS27),may start to be competitive with cloud for basic stuff. Not coding, not tool use, not yet. But everyday things, like voice typing, opening apps, regular analysis, asking offline questions, could be done in the very near future all on local LLM.

  • 4.8

    New Opus 4.8 came out. I saw the video announcement on YouTube but what wasn’t captured was this gem in the release notes.

    One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest—for instance, to avoid making claims that they can’t support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims. This is borne out in our evaluations, which show that Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked.

    It’s nice to see a feature update focused on improving LLM alignments.

  • Choosing the right model

    Claude had a ton of new talks on youtube after their big conference. I thought the below talk had informative take-aways for model selection. If you’re on a Pro subscription you’re likely stuck mostly on Sonnet, but if you’re on Max, then you have a lot more flexibility on what model you get to run full-time.

    The video had some nice summary slides. Like the block below. Tell me that you knew all this already and you get a shiny reward.

    It was also interesting (and slightly frustrating) to know that sometimes using a newer model, was more efficient. Meaning, just because you run the Sonnet model, doesn’t mean it optimizes the tokens and you get the best result. You could run Opus 4.7, be faster, and save tokens. The explanation made sense. The newer models are smarter, so they reason better and can use tokens more efficiently because they may skip tasks, or death spiral less frequently. But… not always. So you have to evaluate.

    While this makes sense in the context of recurring tasks or prompts, what is still unclear is how might one know what model to use… prior to doing an evaluation. I find this part of the equation partly why it seems easier to just always use Opus 4.7 xhigh. How do I know what result I will get between high vs. xhigh and how my choice of this setting will impact my outcome? This part of model selection is user unfriendly. I’ve noticed that thinking is now “adaptive” rather than an option of yes/no in Opus 4.7. I imagine having the LLM decide based on prompt is likely the best path.

  • Antigravity 2.0

    Google just dropped Antigravity 2.0 and ripped out the VSCode fork. I had been using AG since its early preview because it had extremely generous free Gemini caps within the IDE, all with a VSCode interface. With 2.0, it seems, they just decided to be Claude Code with a Gemini wrapper.

    So 2.0 = Claude Code. It has scheduled tasks, connectors, MCP, projects and a primarily chat interface. Guess Google decided Anthropic won and with Codex doing the same, they couldn’t fall behind. They put a CLI behind it, and added projects, worktrees everything in Claude Desktop. And moved the IDE into another product.

    It was inevitable because I think Anthropic proved, this interface is the right interface for most people. VSCode was never going to be the UX that my wife uses. Claude Desktop while having its own confusing UX issues, is a understandable means to connect with, and interact with an LLM. So here we are. Microsoft is probably next with Copilot. It’s about time. Copilot is a hot mess. So much so that even Microsoft employees do not want to dogfood it.

  • Into the rabbit hole of a rabbit hole

    If you are a Claude Code regular you know that using it in terminal is the way to go. What seems intimidating becomes second nature over the desktop app. But unlike the desktop app which organizes your session views nicely, Code has always been at the mercy of your tmux client and terminal sessions. Super easy to start scrambling for screen real estate with all your terminal windows and tabs.

    This week, Claude released Agent View for Claude Code. By hitting the left arrow on your keyboard in a Code session, you can get into an overview screen of your Claude agents. From here, you can spawn agents to do simultaneous work. Your agents can start working on features, planning enhancements, building tests, or bug fixing – all at once. Without opening another Claude Code terminal window. It leverages this by spawning multiple git worktrees on the current working folder so that agents do not interfere with each other.

    If you’re in an existing Claude Code window, you can even background that that so that the session stays alive once you close terminal. Even crazier, is that you can have multiple terminal windows open and go from terminal (1), jump into agent view, to jump into the Claude session from terminal (2).

    What manages this, is what Anthropic is calling the Supervisor process. It’s like a motherboard for your agents that remembers state. The whole thing is very freeing as where you once needed to have multiple terminal windows open to work on the same tasks, you now can have just one terminal and agent view.

    But can I just say with a chuckle that, I feel terminal was not designed for being blown into a multi-windowed-multi-paned orchestration engine. It’s getting a bit out of hand. I often get lost in what exactly this window is showing me as I fall into a rabbit hole of terminals within terminals.

    I have by my count lost my way many times staring at a black terminal window trying to recollect where in the matrix I am. Because I have:

    • Claude Code running in every terminal window.
    • cmux managing my terminal workspaces where some workspaces have multiple 2-up or 3-up panes.
    • Claude agent view allowing me to spawn 3X more sessions in the background hidden from view.
    • Claude agent view allowing me to move within different terminal windows without leaving my terminal window
    • git branches where I have to remember what branch I had pulled and what working folder I’m in because every feature is usually on a new branch
    • git worktrees, which the agent view will use because agent view has to use worktrees so as not to conflict in the working folder
    • git worktree branches which… same as above

    Insane.

  • Non-determinism

    There is a weird shift in how you go from a world where coding is a very structured and deterministic thing, to vibe coding. As a coder you might write a script. Every time you run that script the same thing happens. every. time.

    When working with LLMs it is easy to forget that this thing is not what you think it is. It has structure like code, it has guardrails like code, it can even write structured code with guardrails! But run it enough times and you realize that it is not always doing the exact same thing.

    And I find as a coder, it can be hard to change that way of thinking. I want to think of an LLM like a new abstraction layer over code that works the exact same way.

    I have a family dashboard I created with Claude. It has three parts. Schedule, Homework and Meals. Every week on Sunday, it pulls each of our schedules, extracurriculars, extracts my kids calendar from their teachers handouts, overlays their important events for the week and generates a week view. I also work with my kids on their homework. I generate homework exercises as per the school curriculum and commit to memory progress. I ask Claude to write new homework into a set of daily one-pagers I can easily print out for the kids. Lastly, I get Claude to randomly select favourite recipes from my note repo, extract all the ingredients, put it back into my note repo so I have a grocery list every Monday. It works great. But it didn’t always.

    At first, after a bunch of prompt testing I had a prompt that would run and generate what I thought was perfect. And next week, something would look off. It could be the schedule was in rows, rather than columns. I could be that their homework would have 5 questions instead of 15. It could be that the recipe cards I ask it to create no longer fit on one line.

    My point to all of this is to say that as a coder, I was treating LLMs like code, but they’re kind of like people. Being non-deterministic, every time they do something they may do it slightly differently. That is unless you force some structure.

    And so I learned, if you need something the same every single time, and often for output, that is what we want… you have to use a combination of prompting and coding knowledge to force the LLM into a structure. This could be by forcing it to deploy a script (which is deterministic) or having a template with built-in sentinels. Frankly these quirks feel odd but thinking about how we are coding using just our natural language alone, the trade-off seems fair.

  • Claude Routines

    Since Claude released routines, I’ve had a blast finding ways to automate my code. I have a laundry list of personal projects in my GitHub repository that I work on with Claude, daily. I regularly have 5+ Claude code sessions coding away on random thoughts and prototypes all at once. So when routines released, I wondered how different it would be over scheduled tasks for cowork. The beauty for me is in the tool calls.

    I use Claude Cowork to run a daily financial market analysis and refine a daily thesis that’s ready in the morning. Scheduled runs prompting at a given time and day are a part of cowork that I love. Cronjob with more intelligence.

    Claude routines can use my GitHub integration to run everything in the cloud without my machine being on. It pulls my project code, uses a managed instance to run everything and pushes it back to Git after it’s done. It operates like a coding partner for me at night.

    Just like that, I now have a bunch of Claude Code routines that trigger nightly, review my project codebase for issues, propose enhancements. Every morning I end up with a list of bug fixes and proposed enhancements. I recently changed my routine to straight up pick an enhancement to work on so when I get up the feature is ready for PR. Claude documents the change, performs security and dependency reviews, and summarizes the change every night. My projects are slowly building themselves as I direct projects rather than code them.

  • thrift and tokens (printing press)

    Everyone knows token economy is 💰

    I came across this new abstraction layer for integrating external tooling into LLMs, called printingpress.dev.

    From an API spec, from a website with no public API, from a beloved community fan project – one command prints a token-efficient Go CLI, a Claude Code skill, an OpenClaw skill, and an MCP server. Peter Steinberger showed the way with discrawl and gogcli: a local SQLite mirror beats a remote API call, compound commands beat ten round trips, and an agent-native CLI beats raw HTTP. The press bakes that playbook into every binary it prints. Muscle memory for agents.

    It uses custom compiled CLI saving valuable token exchange commonly seen with MCP, connectors etc. MCP traffic is heavy. Part of why my exploration into local LLMs stopped was that I realized how much context an MCP exchange takes.

    Similar to how exa, tavily MCPs clean up garbage from the web to provide LLM clean search, printingpress goes a step further and forgoes the whole MCP exchange for a CLI interface that runs locally and does all the dirty work more efficiently, saving your tokens.

    The beauty of it is there is also a prompt kit that helps generate brand new CLIs from any service. So point it at a service and watch it go.

  • Faster! Faster!

    Having played with LLMs for a few years now I’ve had various stages of appreciation for its efficiency.

    1. Tell me some jokes..
    2. You coded a debugging nightmare.
    3. Hey this is kinda neat.
    4. Think for me. I’m too lazy to look it up.
    5. Spawn five of yourself and wire it up.

    It happened so fast. For me, tool use has been the most eye opening. To see Claude computer use, review functionality, that it just implemented, by itself, by literally clicking around the iOS app it just built, is astounding.

    I dove into an article on Sherwood News, Test time. It made me think about hiring the best people for tomorrow. Imagine you are looking at candidates. How can one justify hiring someone who has no experience with the potential of LLMs?

    Instead of simply talking through strategy, some CMOs, investors, and operators are now being asked to use AI tools live — or during a tight take-home window — to create something in front of interviewers. A number of other firms do the same, while Nicole DeTommaso, a principal at Harlem Capital, says that anecdotally, she’s seen practically every potential candidate looking to join a venture capital firm being asked to show their prowess with AI coding tools.

    DeTommaso wrote that one candidate was asked to build an AI agent that could produce automated research about industries within a working week that could reliably brief partners on a sector before they invested. Another needed to use the likes of Claude Code and Codex to vibe code a dashboard to show information about portfolio companies.

    “You are not told which tools to use or how to go about it. You are just expected to figure it out,” she wrote. “And increasingly, what you can actually show in an interview matters more than what’s on your resume.”

    At an individual contributor level it seems risky to hire someone who would be doing things “the old way”. It’s like signing a flat footed defence man in the world of Cale Makars. Speed is the game now. And at a leader level, Arguably it applies too where the best managers should excel at delegating to LLMs. It’s easier than ever to test and prototype. At a fraction of the cost before AI.

  • Gemma4 MTP

    Google released Gemma4 MTP which incorporates a new feature, speculative decoding. Another lightweight model does token prediction speeding up the work for the larger model making the token speed up to 2-3x.

    I saw an cute ELI5:

    Imagine two bears, a big slow bear and a little nimble bear looking for berries. The little bear runs off first and finds a bunch of berry trees and yells for the big bear. Big bear comes and decides which berry tree is most delicious and makes the final call to grab it.

    Unfortunately for me, my system still cant run it.