Oliver Ng. Ai experiments.

Cowork Mobile

July 17, 2026

Oliver
Cowork can now be steered and accessed on mobile. Previously, cowork ran on a local OS container, with no remote access. That said, Anthropic has done a terrible job explaining how this all works.

Here’s a top ten list of my experience.

Any projects that you were using for cowork will not migrate over to cloud if they already access local files on disk. There is no obvious migration tool to make it work via cloud.

I found the best way to migrate was to create a new project in cowork, and then add the same local folder.

You’ll see this in your cowork projects list. There is a drop down for configuring the project to run in the cloud and it is disabled for existing projects involving folder access.

By creating a new project, it gives Claude the ability to re-establish the permissions it needs access a local folder + run in the new cloud format.

If you are working on a local cowork project, your conversations will not sync over to mobile. It is the most confusing part of this whole roll-out. Some conversations sync, others do not.

The easiest way to know if your cowork is working in the cloud, is that there is a little icon on the top left that shows a cloud icon. Same concept as Claude code. If you see a cloud, Claude is working from a cloud container. If you see a computer, Claude is local to your desktop.

The benefit of working in cowork mobile is that you can leverage your MCP connectors and local folders you’ve approved, and start a new task away from your computer. This is different from Claude Chat because while chat has some connectors, it is not MCP level and cannot work with files.

Claude cowork cloud very slow. Like, sssssssllllllooooooowww. I had cowork local tasks running snappy and when I moved them cloud-native, it got rough. Part of this is that managing files across MCP connectors is brutal. If you are familiar with using local CLI based connectors like google workspace-CLI where Claude Cowork could directly call a CLI, you can’t do that in a cloud container. You are reliant on the cloud container having access to the MCP connectors you’ve pre-configured. And that means MCP inefficiencies.

Part of the benefit of running cowork was direct file access and manipulation. File access is easy because Cowork would just pull the file into the mount path of its local container. In this remote cloud paradigm, that is no longer as easy. I did a test to move my folder into Google Drive, thinking it could replicate my local folder structure on disk. I immediately ran into problems.

It was slow. I know I said this before, but man, it is so bad.

MCP connectors are limited when it comes to write. Google Drive for example, lets you create files, but no update. You have to reupload each time.

Lack of MCP connector exposure to easily copy a file. The cloud container had to direct stream a file byte-by-byte over MCP, with the model as the translator. 500K tokens for a 500K file. Brutal.

In the end, I reverted back to local only for these projects because running everything through a kneecapped MCP was a terrible experience.

I think MCPs are the weakest link here. I wonder if we could get faster tooling on the cloud container, like the ability to have the googleworkspace-CLI installed an built into the container to improve speeds and capabilities rather than forcing us to go over MCP.

The hybrid cloud/desktop approach seems to be the goal-post here. Steer from mobile but access data from local disk via Claude Desktop. Occasionally start tasks from mobile where the ask is not heavy or time-bound. This requires your desktop to be open but that was always the case with Cowork scheduled tasks.

It is incredibly unclear what I can and cannot do at any moment. It is not clear, what folders I have access to from my session, it is not clear how I start a session to give myself folder access, it is not clear how to access folder enabled sessions from my mobile. All things that are being advertised.

On multiple occasions, the Cloud container just stopped responding. Like out of 30 tasks, around 5 just stopped where I was wondering where that conversation went, only to revisit it and notice it says “Starting .. 1s, 2s” all over again. This is after being prompted over 10m ago. Chalk it up to being in BETA.

It all feels like a worse Claude Code remote-control. I can kind of feel the friction the product owners were likely working through when they were building this out going, well what about this problem scenario, or this one, or that one.

For example, if I start a cowork session from the mobile client, even if I’m back on my desktop, I cannot then add folders or files to the discussion. I have to start a brand new chat from my desktop. friction.

All said, it’s an improvement from where we were before. I can access Cowork from my phone, where as yesterday, I had to go home and unlock my desktop. So that’s a win. And I expect they will iterate from here given that cowork is a very popular product. Unlike Dispatch? Can’t remember the last time that got some love since it was announced.

A day of Claude workflows

July 14, 2026

Oliver

I tried to come up with a scenario to try out Claude Dynamic Workflows, a deterministic way to manage processes for subagents. Claude recommends you use them for large scale activities that benefit from multiple agents. I wanted to understand how it is different from Claude Agent Teams, which, is easy to forget also exists.

One noticeable difference, is that agent teams shares context within the group. Workflow agents do not and act independently on the prompt. Workflows are also programmatic. Their sequence of actions are managed through python code + prompts. This allows you to deterministically define workflow and results.

I thought it would be fun to take advantage of this knowledge and use it as a way for AI to reason about security risk.

A common argument in Cybersecurity, is risk. You have a vulnerability, we have to quantify the risk. Not always trivial. Fights ensue. People say that it will never happen. They say the architecture does not allow it. They say controls exist. Everyone comes with their independent views. I took these horrible memories of my career, and built an agent workflow that could debate risk – in hopes I would never have to live it again.

Kidding aside, it was a fun experiment. LLMs are great at grounding themselves into a specific viewpoint, taking context, and providing a result. So I made a roundtable of tech personalities that can repeatedly play out my worst nightmares.

The Panel

The roles I chose were a Security Analyst, a Business Owner, and an Architecture Lead. These three signify typical friction I’ve seen where each debate from their own area of accountability.

Security Analyst: Argues for stronger controls and higher severity where evidence supports it. Accountable for what happens if the risk is realized. They are normalized by ensuring they are not simply debating that all risk is critical risk but to be defensible and grounded.

Business / Product Owner: Optimizes for time-to-market, user friction, and opportunity cost. Argues for acceptance, deferrals, or compensating controls. Accountable for the cost of not shipping, including security costs of users staying on worse legacy systems.

Architecture: What can be built and run, and what each control costs in complexity, blast radius, and operational burden. Often the tie-breaker — surfaces that a control Security wants may be infeasible as stated but achievable in another form.

I thought it would be interesting to have additional roles that clean up the debate and tie-break.

Chair: Summarize key debate items. Takes no side. Translates the risk debate items from technical language into common human language. This Chair will help summarize the “agent talk” so that I can understand what’s being said and debated.

Skeptic: Acts to be “that guy” that only has negative things to say. Will debate the specific controls, proposals, risks to find counter-points against the role proposing the item. It reinforces a sanity check to re-think on ideas. Protects against LLM “making stuff up and confirming their own behaviour”. Adds rigor to analysis.

The workflow

When running the workflow, the idea is to provide a risk statement to the panel. As with all LLMs, the more context, the better. I provide a scenario, describing the system, then provide a security finding and recommendation. It’s like an everyday scenario, where a security guy walks into a room with the product team and goes over findings. I gave it something like below (shortened).

Scenario: a cloud-native multi-tenanted SaaS product, generates customer reports with regulated PII, and store these reports in a cloud storage container.

Risk Question: A control requiring tenant-wide customer-managed keys (CMK) is not in place.  Should we block the product team from going live with platform-managed keys, or risk-accept it.

I like this question because I find the use of CMK contentious. It is one of those items where so many people feel it is over-engineering security, given that we are encrypting encrypted data. But everyone typically agrees that a regulatory bodies will not accept that a “platform” holds your encryption key and is a party to the regulated data.

The workflow ran and the outcome of the Agent debate was entertaining.


Security.  Block.  

Rationale: One key compromise = full tenant exposure in a regulated industry.  At least block until production team  commits to implementation date.

Trade-off: Understands CMK is ops work, but won't back down due to regulated data.

Business. Accept.

Rationale: Revenue is blocking on go-live.  Feels this is a LOW risk as platform managed key still encrypts data at rest.  Referenced a compensating control of customer having a contract in place, IP-allowlisting, per-customer tokens and 24h data purge.

Trade-off: Delayed revenue for a quarter, acknowledged case relies on allow-listing to be implemented per customer as contract states.

Architecture. Partial. Tie-break.

Rationale: Agrees with security but agrees with product that while tenant-wide CMK is not feasible in the time period, an alternative of store specific CMK is doable.

Trade-off: PMK acceptance today will be costly later down the line with full CMK.

I love this.

It is like playing The Sims, watching agents argue with each other about my typical security life. Except I have zero stakes in the discussion.

Seeing the arguments reflected in appropriate standpoints shows me how I can use LLMs to help iterate on decisions and be more open minded in my analysis. There’s more and this is where the role of the skeptic and chair come in.

Skeptic.  Refute the lowering of risk.

Rationale: States that there is no proof that the allow-list is in place.  The product agent stating that the risk is lowered, as a compensating control, is unproven.  The panel needs to demonstrate the allow-list is in place with configuration evidence.

Chair.

Surfaced: the allow-list is a contentious issue and will change the risk between a medium vs high.  It should be confirmed.

Proposal: Recommend CMK on the storage today, with full CMK later. Architecture's proposal is a good middle ground.

I like how the skeptic played its role in calling out a common scenario where teams will say “Well, we have this in place.” and then security to say “Show me. I didn’t see that in my data.” And the chair also played its role in not providing a stance, but to give that “management view” of analyzing the whole debate and saying, well this makes sense, this is the pragmatic recommendation.

The table the workflow generated as a summary. A nice touch was it generated a new security requirement too:

Summary	Do not accept PMK as-is. Take Architecture’s middle path: enable CMK on the export store now, and defer only per-tenant key isolation. The risk is High today because the likelihood-reducing controls Business proposed are unverified.
Rating	High – likelihood Medium x impact High. Would drop to Medium only if the IP allow-list enforcement is confirmed.
Recommended decision	Mitigate by compensating control (CMK-on-store + SAS + 24h purge) now; defer per-tenant key isolation to next quarter.
REQ-NEW	Export store must enforce IP allow-listing.

I should add that I added a risk rating rubric to ensure that risks were following a standard and deterministic method. The Agent Workflow is programmed to use this rubric when evaluating risk across all sub-agents.

Closing Thoughts

Long post. I really enjoyed this experiment. There is a lot more that can be done here too. We could have additional agents that go out and search for real-world examples of security issues being debated, CVEs, more personas, etc.

Finally, I should add that while this was an experiment for Claude Workflows, Agent Teams can also execute the same task. In fact, my workflow allows for both a “Quick” mode using Agent Teams and a “Full” mode using Workflows.

Agent teams, will use:

shared context
parallel agents
repeatable via best-effort basis
recommended output formats.
cheaper

Workflows, will use:

separate agent contexts
structural independence
deterministic, auditable, repeatable
dedicated agent pass by the skeptic
outputs enforced by schema.
expensive, as it will spawn multiple agents + overhead orchestration agents

The auditable part of Claude Workflows is unique. Workflows give you a structured run journal, that provides a view of exactly what was done and when, compared to agent teams, that only provides a conversation.

Workflows is more costly. My experimenting on Opus 4.8 has shown about 35% cost overhead but depending on how important having a deterministic, agent orchestrated run is to you, that may be acceptable.

On-device LLMs in iOS 27

June 26, 2026

Oliver
Apple posted a nice WWDC demo of how local LLMs will work on iOS27.

I’ve been building a lot of iOS apps, replacing all the main apps with ones that are super personalized. Apps I have full control over. One of them is a news aggregator super-app. I was growing tired of opening each news site’s app, and while it’s nice to get AI generated morning summaries of news, it was inflexible compared to what a native app can do – think video, audio, article refinements.

I wanted my app to surface articles and topics that resonated with me and I knew that required a model to run calculations across topics of interest. While I landed on using NLEmbeddings to link up my interactions with my interests, embeddings are simple and vector based. I kept thinking how much more I could do with the benefits of a local LLM.

As I was searching on new possibilities, this demo video from WWDC26 came up on YouTube. It is a language learning app. Point your camera at an object, get a Mandarin vocab card generated entirely on device. On device, open-source models are doing the LLM work. SAM 3 for image segmentation, Qwen 0.6B for language generation.

The video answered a lot of my curiosity for how local models would work in iOS27. I am pleasantly surprised:

No server, no API costs, no data leaving the device. Private by default, which is the straight up benefit of being local. But it’s cool that I can download an entire edge model on the device.

Ahead-of-time compilation. Pre-compile on your dev machine so that users are not waiting for model loads. If you’ve run a local model on Mac, you know the pain of waiting for a model to load into memory.

Background asset delivery. Models download only if the user opts into the feature. This is obvious given the space that models take up, but it is good knowing the SDK acknowledges it.

Same code runs on Mac with a larger model. Swap Qwen 0.6B for 8B and you get better reasoning, longer context, etc. Another nice touch, in doing a swap based on system capability.

I’m excited to see the SDK for iOS27 and building for it in the near term. And maybe it’s just me, but I’m blown away seeing Qwen mentioned in a live demo, on an iPhone, from Apple.
You token maxxin’ too?

June 19, 2026

Oliver
It is unsurprising to see organizations scrambling to control AI costs. I am spoiled by subsidized tokens. On Claude I select Opus 4.8, set thinking to MAX, and run my prompt … because I can. On Claude Max or GPT 5.5 Pro, the limits are so generous you feel like by not doing it, you are wasting quota. 10x FTW.

In the current token economy I do not see a rationale to downgrade my model for a prompt to save tokens. Default and go. Times are changing. AI companies are pushing organizations into usage based billing. GitHub copilot changed its policy last month. And people were pissed. Makes sense. You started at $50-100 a month with subsidized usage. Subsidy is gone, and your bill is $2000-$5000.

Will AI companies push everyone to usage based billing? Unlikely but possible that compute is less generous to the point that it’s noticeable. Similar to what we get with Claude Pro vs Max. To save on cost, will people want to invest in local LLMs? Microsoft is reportedly looking into Deepseek to reduce token costs. I think what’s next is a few things.

We get better with token optimization. Select a GPT 4, when 5.5 isn’t required. Start pinching pennies on our prompts and use the dials when appropriate.

Local models for appropriate tasks. While local models that code need beefy systems, professional office work can be serviceable with models < 16B parameters. My experience is that context switching models is not intuitive.

Maybe this is solved by a system like OpenRouter where all models are accessible and can be programmatically chosen depending on the task. Because to expect me to hit a drop-down select every time I prompt or mid-context, is just not happening.

Frontier models keep getting better to the point that older models become more affordable and “fine” for everyday coding. The problem here is that compute is limited. If we look at token costs for Claude, over the past year, the cost per token has not decreased for older Sonnet or Opus versions – but between models there is a notable cost difference. But compute is compute, the cost premium is on the model and effort, so affordability is gained via product choice, not a compute matter.

Local models for everything. Unlikely with the current token economics of memory. Increasing unified memory from 16 GB to 48 GB on a MBP puts a $2K laptop at $4K. This is the bare minimum needed to run a 32B parameter model. And you will likely still swap. 64 GB MBP forces you into Ultra territory at $5K. The trade-off is real for businesses where a $5K laptop may save under usage based billing scenarios.

In the consumer world, frontier AI companies hungry for fresh data are fighting for users. The cost subsidies will continue. There is talk of a price war now that models are mostly commodities. Google kicked it off by making Gemini AI Pro half the cost while giving a lot of Google Premium Perks. It made me take a hard look at my setup. But the Claude harness is too good right now.

If I was forced into usage based billing, I would have to look at my setup and make hard decisions for how to optimize. I would likely start by switching Frontier companies. I often need to go over to GPT or Gemini when I hit my usage on Claude Max. And.. for most of what I do it is fairly interchangeable because they’ve all copied Claude’s harness which improves user choice.
Satya Nadella on Hard Fork

June 13, 2026

Oliver

Enterprise isn’t sexy. But enterprise is hard. Always felt Microsoft knows enterprises best. I think about how they destroyed Slack’s growth when creating Teams. I remember the day Slack got cocky, took out a full page NYT ad, and the Startup world laughed. 10 years later Slack, now owned by Salesforce, is an enterprise afterthought. Microsoft has made some big moves to keep relevant and I admire the choices. I remember the day they moved Office to the cloud, made everything multi-platform (not Windows!), and took a bet on Azure. These were Satya Nadella’s decisions.

When I hear Satya Nadella, I take out a pen and paper to write down how he says everything because the manner in which he speaks and creates an argument, ties it back into his role as Microsoft CEO, and makes a case for it, puts me at awe. He speaks very differently than a startup founder. I admire how he takes a hard question, goes far away with an anecdote, raises a relatable point, and draws that point back to the question.

Casey Newton: “But so much has been happening at Microsoft with AI since then. So just catch us up. For people who may not have tuned in a little while, what have you guys been up to in AI?”

Satya Nadella: “Yeah, I mean, look, the fundamental thing that I feel we’re about to move from is not talking about AI as one thing, to sort of having even a mental picture of what is an ecosystem that is sort of driven by AI, right? So today, if you think about it, even since when you first used Sydney to now, it has been about frontier model. You sort of talked about Fable, what have you.

But if we are ever going to transition to an economy that is driven by AI, it can’t be about one model, it can’t be about three firms, it has to be something that’s broadly felt, where the economy is at the frontier, not a firm or a model is at the frontier. So Microsoft is a platform company, to me that’s what we are up to. So to me, we had our developer conference last week.

It was all about, hey, can we build a platform and the tools where every enterprise in every country can operate at the frontier. To me, that’s the question. To be saying, hey, my model does this, but the economy is growing at 2 percent, means this is not going to end well, unless we really get to a place where the economy is inflecting in terms of its economic growth and its broad spread because the frontier benefits are.”

He was the latest guest on the Hard Fork podcast by Kevin Roose and Casey Newton. And this week’s topic was talking about Microsoft’s vision for AI.
Gemma4:12b experience

June 8, 2026

Oliver
I’m playing around with Local LLMs again! Using Gemma 4 12b on my underpowered Mac M4 and it is good. Much better than when I first played with Gemma 4 E4B. The tool calling is so much better and the general day to day feels crisp. I’m not doing any benchmarks, just running my life.

A lot of r/LocalLLM chatter is because it is the first model to run on laptop class 16GB RAM, the likes of MacBook Pros with unified memory. And it does! Barely. Running in oMLX cost me 11GB. Was like living in 640K memory in the DOS days. Closing everything in activity monitor. A more realistic scenario is using the GGUF 8GB which is survivable. Anyway, point is, if we were not in 2026, living memory poor, one can easily see a future where laptops shipped with local LLMs doing real work, real tool calls, all privately….with more RAM.

Exploring this further I wanted to see what was new in the 12B architecture and went down a small rabbit hole about why Google released 12B and its architecture in their developer post.

Gemma 4 12B introduces several milestones for local AI:

A multimodal encoder-free architecture: Bypassing heavy multi-stage vision and audio encoders entirely, multimodal data is fed straight into the LLM backbone, reducing multimodal latency.

Our first medium-sized model with audio input: In the Gemma family, audio inputs were restricted to small, lightweight edge architectures (e.g. E4B). Gemma 4 12B is the first medium-sized model capable of natively ingesting audio.

Gemma4 being encoder free was a new one to me. On their developer post they had this image which only confused me more.

Thankfully, they also included a link to Maarten Grootendorst’s post A Visual Guide to Gemma 4 12B which explains it eloquently. Every few month there are new optimizations and ideas on how to fit more into less.
Gemma4 12b

June 4, 2026

Oliver

With Nvidia announcing the RTX Spark to challenge Apple’s MX reign on local LLMs, it is appropriate that Google dumped some toys on us the next day.

Gemma4 12B is out today, in a sweet spot between their prior 8B and 26B I’m super interested to see how it stands up. 8B was a great chatbot, but a terrible tool bot. 26B seems much better but I could never run it on my M4 Mac mini 16GB. Does Google see 12B is “about right” for most spec’d out systems today – it runs on 16GB! Downloading now….

Unexpectedly, Google added to LiteRT-LM a local LLM server / CLI!

The LiteRT-LM CLI provides a lightweight, zero-code tool for running language models locally. We are now expanding the tool with the serve command, letting the CLI act as a drop-in local LLM server. Use this functionality with Gemma 4 12B to point any standard tool, SDK, or framework (such as OpenClaw, Hermes, OpenCode, Pi, or popular extensions like Continue and Aider) directly to your local endpoint.

Aside from the fact that the internets are saying RTX spark is going to cost $5000+ it does not seem far away where local models, even on edge devices (looking at you iOS27),may start to be competitive with cloud for basic stuff. Not coding, not tool use, not yet. But everyday things, like voice typing, opening apps, regular analysis, asking offline questions, could be done in the very near future all on local LLM.
4.8

May 30, 2026

Oliver

New Opus 4.8 came out. I saw the video announcement on YouTube but what wasn’t captured was this gem in the release notes.

One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest—for instance, to avoid making claims that they can’t support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims. This is borne out in our evaluations, which show that Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked.

It’s nice to see a feature update focused on improving LLM alignments.
Choosing the right model

May 22, 2026

Oliver

Claude had a ton of new talks on youtube after their big conference. I thought the below talk had informative take-aways for model selection. If you’re on a Pro subscription you’re likely stuck mostly on Sonnet, but if you’re on Max, then you have a lot more flexibility on what model you get to run full-time.

The video had some nice summary slides. Like the block below. Tell me that you knew all this already and you get a shiny reward.

It was also interesting (and slightly frustrating) to know that sometimes using a newer model, was more efficient. Meaning, just because you run the Sonnet model, doesn’t mean it optimizes the tokens and you get the best result. You could run Opus 4.7, be faster, and save tokens. The explanation made sense. The newer models are smarter, so they reason better and can use tokens more efficiently because they may skip tasks, or death spiral less frequently. But… not always. So you have to evaluate.

While this makes sense in the context of recurring tasks or prompts, what is still unclear is how might one know what model to use… prior to doing an evaluation. I find this part of the equation partly why it seems easier to just always use Opus 4.7 xhigh. How do I know what result I will get between high vs. xhigh and how my choice of this setting will impact my outcome? This part of model selection is user unfriendly. I’ve noticed that thinking is now “adaptive” rather than an option of yes/no in Opus 4.7. I imagine having the LLM decide based on prompt is likely the best path.
Antigravity 2.0

May 20, 2026

Oliver

Google just dropped Antigravity 2.0 and ripped out the VSCode fork. I had been using AG since its early preview because it had extremely generous free Gemini caps within the IDE, all with a VSCode interface. With 2.0, it seems, they just decided to be Claude Code with a Gemini wrapper.

So 2.0 = Claude Code. It has scheduled tasks, connectors, MCP, projects and a primarily chat interface. Guess Google decided Anthropic won and with Codex doing the same, they couldn’t fall behind. They put a CLI behind it, and added projects, worktrees everything in Claude Desktop. And moved the IDE into another product.

It was inevitable because I think Anthropic proved, this interface is the right interface for most people. VSCode was never going to be the UX that my wife uses. Claude Desktop while having its own confusing UX issues, is a understandable means to connect with, and interact with an LLM. So here we are. Microsoft is probably next with Copilot. It’s about time. Copilot is a hot mess. So much so that even Microsoft employees do not want to dogfood it.