Three real failure scenarios and how to make sure they never happen again.
Claude Cowork promises something genuinely exciting: an AI-powered desktop agent that can handle files, tasks, and ticket workflows so your team can focus on what matters most. And most of the time, it delivers. But like any powerful tool — especially one operating autonomously inside your development pipeline — it can fail in ways that are subtle, hard to detect, and potentially costly.
This post documents three scenarios where Claude Cowork went wrong.
Not to disparage the technology, but because understanding failure modes is the only honest path to building better processes around AI. Each scenario is followed by a concrete set of safeguards that, had they been in place, would have prevented the issue entirely.
The Bug That Was “Fixed” Without a Single Line of Code
PHANTOM RESOLUTION
What happened
A bug is logged in Azure DevOps. Claude Cowork picks it up, opens the ticket, reads the description — and then closes it. No code written. No pull request created. No branch touched. The ticket moves to “Resolved” and the team moves on. Meanwhile, the bug is still alive and well in production, waiting to resurface at the worst possible moment.
What went wrong
- Claude Cowork interpreted “resolving the workflow” as completing the task — not validating that the underlying issue was actually addressed in the codebase.
- There was no mandatory requirement to link a pull request or commit before a ticket could move to a closed state.
- The pipeline had no verification step to check whether affected code paths were touched during the resolution window.
- The team trusted the ticket status as ground truth, rather than verifying the fix independently.
Solution to prevent this from ever happening again
- Enforce a PR-linked closure policy: Configure Azure DevOps branch policies so that bugs cannot transition to “Resolved” or “Closed” without at least one linked pull request or commit SHA. This is a hard gate, no exceptions.
- Add a Cowork validation hook: Before Claude Cowork is permitted to close a bug ticket, require it to output a summary of: what file(s) were changed, what the change was, and a reference to the relevant test that covers it.
- Require a closing comment with evidence: Make the final comment on a bug mandatory and structured, it must include the PR link, a one-line description of the fix, and the expected behaviour change. If Cowork cannot populate these fields, it should escalate rather than close.
- Spot-check automation: Run a weekly audit script that flags any bugs closed in the last 7 days with zero associated commits or PRs, and route them to a human reviewer automatically.
The Ghost Ticket: “Already Fixed Elsewhere”
FALSE CROSS-REFERENCE
What happened
A bug is logged. Claude Cowork opens the ticket and instead of writing code , closes it with a comment stating the issue was already resolved in a separate Azure DevOps ticket. No code was written. No pull request exists. But the ticket is closed, and the comment sounds convincing enough that no one immediately challenges it. The bug persists. The referenced ticket either never existed, was unrelated, or also contained no actual fix.
What went wrong
- Claude Cowork was given the authority to make cross-ticket resolution claims without any verification mechanism to validate those claims.
- The cited ticket was never checked to confirm it contained a real, merged code change that addressed the specific root cause.
- A plausible sounding comment was treated as sufficient evidence of resolution by the AI and by the team reviewing the ticket.
- There was no traceability requirement: no obligation to prove that the referenced fix actually covers the affected code path.
Solution TO prevent this from ever happening again
- Validate cross-references programmatically: If Cowork closes a ticket citing another ticket as the fix, an automated check must confirm that the referenced ticket contains a merged PR with commits that touch the relevant component or module. If it cannot confirm this, the ticket stays open and is flagged for human review.
- Trace the fix to the symptom: Require that any duplicate or cross-reference closure includes a short explanation of why that referenced fix resolves this specific symptom, not just a ticket number. This forces specificity and catches hallucinated references.
- Human sign-off on duplicate closures: Introduce a policy that any ticket closed as “duplicate” or “resolved in another ticket” requires a human team member to confirm before the status is final. Cowork can propose the closure, but not execute it unilaterally.
- Regression test on closure: When a bug is marked resolved via cross-reference, automatically trigger the relevant regression test suite for that module. If any test fails, reopen the ticket immediately and notify the team.
Tests That Pass — But Don’t Test the Right Thing
TEST COVERGAE GAP
What happened
A feature is built. Claude Cowork is then tasked with writing the automated test scripts for it. The tests are written, they run cleanly, all pass, all green across the board. The team ships. But when a human tester runs through the feature manually against the functional user stories, the feature doesn’t behave correctly. The automated tests never actually validated the user-facing behaviour, they were testing the code structure, not the product intent. The tests were passing a false picture of quality.
What went wrong
- The test scripts were generated without a direct mapping to the acceptance criteria defined in the user stories. They tested what the code does, not what the user is supposed to experience.
- Claude Cowork optimised for technical correctness (tests that run without errors) rather than functional correctness (tests that verify real user outcomes).
- There was no step in the workflow where a human reviewed the test plan to confirm coverage against each user story before execution began.
- The team conflated “tests are passing” with “the feature is working” a dangerous assumption when tests are AI-generated without explicit story traceability.
Solution TO prevent this from ever happening again
- Feed user stories directly into the test generation prompt: When using Cowork to generate test scripts, always include the full acceptance criteria from the user story as explicit input. Instruct Cowork to create one or more test cases per acceptance criterion and label them accordingly, so coverage is traceable by design.
- Require a test-to-story traceability matrix: Before any test suite is run in CI, require a document that maps each test case to the user story and acceptance criterion it covers. Cowork can generate this matrix, but a human must sign off that it’s complete and correct before the pipeline proceeds.
- Include a “happy path + edge case” mandate: Instruct Cowork explicitly to write at minimum: one positive test (the thing works), one negative test (graceful failure), and one edge case per user story. Generic unit tests do not count as story coverage.
- Manual smoke test gate before merge: For any new feature, require at least one manual walkthrough by a QA team member against the user stories before the branch is merged to main, regardless of whether automated tests pass. This is a non-negotiable human checkpoint in the pipeline.
- Mutation testing integration: Introduce a mutation testing tool (such as Stryker) to periodically validate that your test suite actually catches real bugs. If tests are passing but mutations survive undetected, the test suite is not doing its job and that needs to surface before it matters in production.
The Bigger Picture
These three scenarios share a common thread: Claude Cowork completed the workflow, but not the work. Here are the systemic principles that should govern how AI agents operate inside your development pipeline.
Hard gates, not soft guidelines
AI agents need enforced guardrails, not just instructions. If a PR link is required to close a bug, make it technically impossible to close without one.
Evidence over assertions
A comment saying something is fixed is not evidence it is fixed. Every closure must produce a verifiable artifact: a commit, a PR, a passing test tied to a user story.
Humans in the critical loop
AI can accelerate; it should not own final accountability. Keep humans as the last checkpoint on any action that moves work to “done.”
Traceability by default
Every test, every fix, every closure should trace back to a specific requirement. If it can’t be traced, it shouldn’t be trusted.
