AI Feature Diagnostic Audit

Your AI product shipped. The pipeline works, but in production it behaves in ways no one designed for: it confuses users, fails silently, or breaks trust. The problem is rarely the model instructions.

A common response is to iterate on the prompts or switch models. But the real blockers are usually elsewhere: the AI shows up at the wrong moment, users can't tell when to trust the output, fallback paths are missing, or there's no measurement of whether it's actually working. Meanwhile, users form the habit of working around it.

Book Free Intro Call Email me directly

Who this is for

Product teams who have shipped an AI product and are watching it behave in production in ways no one designed for. The team has either ML/backend or product/UX, but no one owns AI behaviour end-to-end. The AI is user-facing in some way: external customers, internal users, or both. Common fits include B2B SaaS adding AI to existing products, AI-native startups, ops tooling, and internal knowledge platforms.

Less of a fit if you haven't shipped yet, have less than a few weeks of production usage to look at, or your AI is purely backend automation with no user-facing surface.

What gets assessed

A two-week diagnostic across five dimensions using the AI Feature Maturity Ladder.

Product-Journey Fit

Does the feature show up at the right moment, for the right user, solving a clear need? Or does it require users to go looking for it?
UX & Trust

Can users understand, trust, and act on the output? Do they know when to rely on it, and what to do when it's wrong?
Output Quality

Are outputs accurate and consistent, even when inputs are messy or unexpected? How does it handle edge cases?
Measurement & Feedback

Is quality tracked with real data, or based on assumptions that haven't been tested? Does the feature actually improve over time?
Ops & Ownership

Is there a clear owner and improvement process? Does the feature get proactive attention, or only reactive fixes?

A feature's true maturity is constrained by its weakest dimension. The audit finds that dimension and tells you what to do about it. For context on the kind of features I've built and measured against this framework, see the RAG Support Assistant build.

Assessment is grounded in expert heuristic review, hands-on use of the feature, and any user-side data you share (analytics, support patterns, complaints). Primary user research is out of scope.

How it works

Week 1: Evaluation

Kickoff call to align on scope
Product walkthroughs as your users experience them
Heuristic evaluation against Nielsen's usability heuristics and Microsoft's Human-AI Interaction Guidelines
Prompt and pipeline architecture review
Output quality assessment against Husain and Shankar's eval-driven development principles
Mid-week async check-in with preliminary themes

Week 2: Synthesis & Report

Maturity scoring across all five dimensions
Detailed findings grouped by theme
Prioritized advancement roadmap
Report written so a VP of Product or CTO can read the executive summary on its own

Delivery: Walkthrough

One-hour session covering key findings, the roadmap, and your questions

What you get

A maturity level assessment across all five dimensions, with evidence for each score
Identification of the weakest dimension: where the feature will break first, even when other dimensions look fine
A prioritized advancement roadmap: what to fix first, in what order, with effort estimates
A written findings report you can share with your team, investors, or board
A one-hour walkthrough session to discuss findings and answer questions

What you'll need to provide

Required:

A product login for the user role this feature is built for
A guided walkthrough of how the feature is intended to work
Async access during the audit to someone familiar with the AI feature (for follow-up questions)
Signed mutual NDA and Data Processing Agreement before any systems access (templates provided)

Helpful, not required:

Known failures or edge cases you've flagged (otherwise I work from what I encounter)
Read access to feature analytics (otherwise findings on Measurement & Feedback rely on team Q&A and walkthroughs)
Read access to the prompt repository (otherwise reviewed through team Q&A)
Architecture documentation (otherwise the surrounding pipeline is inferred from direct testing and team Q&A)
Existing evaluation data or production log samples (otherwise findings rely on direct testing and team Q&A)

Default scoping is non-PII. If your feature touches user data, we agree on what's in scope and what's redacted before kickoff.

Pricing

€3,000-5,000 for the two-week diagnostic, scoped at the intro call based on product complexity and access depth.

I run one audit at a time. Typical kickoff is one to two weeks after the intro call.

What happens after

The audit gives you a prioritized roadmap. Three ways to take it forward:

Your team takes it from here. The report is written so your engineers can act on it directly. Findings are specific enough to turn into tickets without translation.
I advise while your team builds. Part-time support to review implementations, answer questions, and keep the roadmap on track. Scoped and priced separately.
I implement it. I embed with your team and execute the roadmap directly. Scoped and priced separately based on findings.

Who runs the audit

Portrait of Alfred Persson

I'm Alfred Persson, a freelance AI engineer. I studied interaction technology and design at Umeå University (software engineering combined with UX and usability), then spent five years building production distributed systems where reliability under unpredictable conditions wasn't optional. The AI focus came in the last stretch: building demos and RAG pipelines at ChromaWay on the platform's built-in LLM inference and vector database support, then Datalumina's six-week production AI program.

AI engineering hits the same core problem in a new domain: building systems that fail gracefully when you can't predict the inputs. I apply that combination of production engineering and user-side thinking about trust, uncertainty, and workflow fit to AI products end-to-end. The maturity ladder this audit uses is my synthesis of CMMI, Nielsen's heuristics, Microsoft's HAX guidelines, Google's PAIR work, and Husain/Shankar on eval-driven development. More about me.

How we work together

Fully remote from Mauritius, on EU working hours. GDPR data handling is covered by a standard Data Processing Agreement and EU Standard Contractual Clauses, with a Transfer Impact Assessment available on request. Findings stay confidential by default, and case-study publication requires your written approval.

Let's talk about your AI product

Book a free 30-minute intro call to discuss your AI product, where behaviour is breaking down, and whether the audit fits. Helpful to bring behavioural signals if you have them: usage metrics, user complaints, support patterns, or known failure modes. Not required.

Book Free Intro Call Email me directly