What Mozilla's Claude Mythos work teaches about AI security reviews

Mozilla published a technical behind-the-scenes post this week that deserves the attention it is getting. The Firefox team used Claude Mythos Preview and other AI models to help find and fix an unusually large number of security bugs. On X, the headline quickly became: AI found hundreds of vulnerabilities in Firefox.

That headline is impressive. For businesses, it is not the most important part.

The more useful part is in Mozilla’s own description of the work: the result did not come from pasting a large codebase into a model and asking for vulnerabilities. Mozilla built a system around the model. A harness. Reproducible test cases. Integration with existing fuzzing infrastructure. Deduplication. Triage. Bug tracking. Release handling. Humans still had to assess findings, write patches, review fixes and make risk decisions.

That is the lesson for founders, CTOs and product owners.

AI-assisted security review is becoming very real. But the difference between a useful defensive capability and a noisy stream of plausible reports is the architecture around the model.

The model is not the production system

Many companies still treat AI as a standalone tool. A developer opens a chat window, pastes in code and asks, “Can you spot a problem here?” Sometimes that helps. It is not a dependable security process.

A security process needs different properties:

a clear scope for what is being inspected;
context about how the system actually behaves;
a way to execute hypotheses;
reproducible test cases;
prioritisation by impact;
traceability for later review;
integration with the engineering team’s normal workflow.

Mozilla’s post describes that shift. The model was an important primitive. It became useful through an agentic harness that could turn bug hypotheses into testable evidence. That is a very different level of maturity from “AI as a reviewer in a chat box.”

Most product teams do not need to copy Mozilla’s infrastructure. Firefox is an extremely complex, security-critical project. A typical SaaS product, internal platform or operational tool has a different risk profile.

But the principle transfers directly: if AI is going to do security work, it needs to be placed inside a verifiable loop.

Plausible bug reports can be expensive

The uncomfortable part of AI-assisted security review is not only that models can miss issues. It is also that they can produce very convincing false reports.

That is expensive for an engineering team. A bad bug report often costs more than no bug report. Someone has to read it, understand it, try to reproduce it, disprove it and document the decision. If too many false reports arrive, the team loses trust in the whole system.

This matters especially for small and mid-sized companies. They rarely have a large security team that can triage dozens of reports every week. If an AI tool suddenly produces a long list of alleged vulnerabilities, it can look like progress while quietly consuming the attention of the best engineers.

So the most important question is not:

“Which model finds the most bugs?”

The better question is:

“How do we ensure that a reported bug is reproducible, relevant and worth prioritising?”

That is an architecture question, not just a tooling question.

A good harness forces the AI to produce evidence

A harness is the controlled environment in which the model is allowed to work. It gives the model tools, limits and feedback.

For security review, that can mean:

access to specific parts of the repository rather than everything;
build and test commands that actually run;
sanitizers, fuzzers or static analysis tools;
templates for useful bug reports;
automated checks for reproducibility;
rules for when a finding should be discarded;
links to tickets, pull requests and release processes.

The central point is simple: the model should not merely claim that a problem exists. It should help produce evidence that the team can trust.

For a web product, a first harness can be much smaller than Mozilla’s. For example:

an agent reviews only authentication, roles and data access;
it may add tests, but cannot directly change production code;
every accepted finding needs a reproducible test or a clear scenario;
a human decides whether it becomes a ticket;
accepted findings go into a prioritised security backlog.

This is not science fiction. It is a realistic extension of existing engineering practice.

Where companies should start

The worst starting point is: “Let’s have AI audit the whole system.”

That sounds efficient, but it usually creates noise. A better starting point is narrow and commercially relevant. Security work is most valuable when it targets the places where real damage could happen.

Good starting areas include:

authentication and session handling;
roles and permissions;
tenant isolation in B2B SaaS products;
payment, invoicing and contract logic;
admin functionality;
file uploads and document processing;
integrations with CRM, ERP, accounting or support systems;
data exports and reporting;
webhooks and API access.

For many German and European companies, tenant isolation is particularly important. A bug that exposes one customer’s data to another customer is not just a technical issue. It is a trust problem, a compliance problem and often a business risk.

AI can help here, but only if it understands the domain. A generic scan for “security issues” is rarely enough. The model needs to know which boundaries matter in the product: Customer A must never see Customer B. A support user may read but not bill. An external partner may import data but must not manage users.

Those are product and architecture rules. Without them, an AI system can only search generically.

Security is not only code

Another important point from the Mozilla example: security work does not end when a bug is found.

A finding has to enter a process:

Is it real?
How can it be reproduced?
Which users or data could be affected?
Are there similar known issues?
Does it require an immediate patch?
Which tests prevent regression?
How will the fix be reviewed?
When will it be released?
Does anyone need to be notified?

Many AI demos ignore this because it is less spectacular. For a real business, it is the core of the work.

An AI agent that finds ten plausible security issues is only valuable if the team can turn those findings into better decisions. Otherwise it creates another backlog that nobody owns.

That is why any AI security workflow should connect to operational reality. Who owns the finding? What SLA applies to critical issues? What role does product management play? What role do legal or data protection concerns play? What happens if the finding affects a customer promise?

At McDougall Digital, we do not treat these questions as bureaucracy. They are the bridge between technical capability and business reliability.

AI makes architecture more important

It is tempting to think that if AI gets better at finding bugs, teams need less architectural discipline. The opposite is more likely.

The stronger AI systems become, the more they benefit from clear boundaries, good tests and understandable systems. A well-structured product is easier to inspect. A repository with clear modules, explicit permissions and meaningful tests gives an AI agent better ground for useful work.

A chaotic system produces chaotic output. The model may find real issues, but the team struggles to classify them. Is this a bug or intended behaviour? Is this dependency critical? Is this role allowed to perform the action? Is this code still used?

AI does not make architecture problems disappear. It shines a brighter light on them.

That is good news for serious companies. Many security investments pay twice:

better module boundaries help humans and AI;
better tests speed up delivery and review;
better role models reduce risk and support load;
better logging and audit structures help debugging, compliance and incident response;
better release processes make changes faster and safer.

AI security should not be a separate experiment beside the product. It should become part of normal product and architecture work.

What not to learn from Mozilla

The wrong conclusion is: “We need Claude Mythos and we need to scan everything.”

Most companies need something simpler first:

an overview of their highest-risk areas;
clear system boundaries;
a reliable test setup;
a prioritised security roadmap;
reproducible review processes;
clean technical ownership.

Only then does automation become genuinely useful.

Another mistake is treating AI findings as automatically correct. Even very strong models operate inside a system. Mozilla could use the results because it had an organisation capable of evaluating them. Humans remained accountable for fixes, reviews and releases.

That is the right standard for smaller teams as well. AI can search faster. It can suggest. It can write tests. But security decisions need ownership.

A pragmatic roadmap

For a SaaS product or internal platform, the first step does not need to be large. A sensible starting workflow can look like this:

Choose critical flows: login, roles, tenant isolation, payment logic, admin actions.
Map existing tests and obvious gaps.
Document rules and non-goals.
Build a small AI-assisted review loop.
Accept a finding only when it is reproducible.
Convert findings into normal tickets.
Protect fixes with tests.
Review the process after a few weeks.

The important thing is not to automate everything immediately. The important thing is to close the loop: hypothesis, test, triage, fix, review, release.

Once that loop works, it can be expanded. More areas of the repository. More types of checks. More CI integration. Better reports. More automation.

Without that loop, AI security is just another stream of suggestions.

What this means for serious products

The Mozilla story is a strong signal: AI is going to change security work. Not someday. Now.

But the advantage will not come from trying the newest model. It will come from organising the software so that AI can be used in a way that is useful, verifiable and accountable.

For founders and product owners, this is a strategic question. If a product handles customer data, models contracts, triggers payments or runs internal business operations, security is not a technical nice-to-have. It is part of product quality.

For CTOs, it is an architecture question. Which parts of the system are critical? Which boundaries must never be crossed? Which tests are missing? Which processes break if the number of incoming security findings suddenly increases fivefold?

For teams in the German market, there is another dimension: clients expect reliability. In the Mittelstand, trust often matters more than a shiny demo. An AI-assisted security process can strengthen that trust if it is traceable. If it looks like tool hype, it weakens it.

How McDougall Digital can help

McDougall Digital helps teams treat AI as a serious part of software delivery rather than a novelty bolted onto existing processes.

That can start with an architecture and security review: Where are the highest product and data risks? Which tests and controls are missing? Which areas are suitable for AI-assisted analysis? Which decisions should deliberately remain human-owned?

From there, the work can become practical: a small harness for concrete risk areas, better tests, clearer role models, CI checks, a security backlog, release rules and documentation that clients and stakeholders can understand.

The lesson from Mozilla is not that every company needs its own research lab. The lesson is that AI becomes valuable when it is embedded in a good system.

That is where serious products should start now.