The Architecture Musings | An Examination of Capability, Context, and the Testable Edge of AI

There is a moment in nearly every enterprise architect's week that looks deceptively administrative. They are documenting business capabilities, mapping applications onto them, and lately marking which of those capabilities might be improved with AI. It has the texture of cataloguing. It is not. The decisions quietly made in that exercise determine where AI will help, where it will mislead while sounding confident, and where it cannot honestly be measured at all.

This note is about three ideas that, taken together, make that exercise far less treacherous: the business capability, the bounded context, and the testability of whatever we are about to automate. I have no intention of making it exhaustive. There are whole books behind each of these terms, and I will name them rather than reproduce them. My aim is narrower. I want to show that these are not three separate conversations but one, and that domain-driven design is the connective tissue an architect can use both to simplify capability work and to decide, with some rigour, where an AI agent belongs and where it does not.

A word on altitude before we start, because most of the confusion in this area comes from arguing at the wrong one. The viewpoint throughout is the enterprise's, not a single application's, and not the code's. I will keep returning to that altitude on purpose.

Setting the Terms

As is my habit, let me borrow some definitions before building anything on them.

Business Capability

What a business is able to do, expressed independently of how it does it, who does it, or where. The Business Architecture Guild, in its BIZBOK Guide, treats capabilities as the stable building blocks of the enterprise, and practitioners often describe the capability map as the business viewed "at rest". The point worth holding is durability: a capability such as Acquire Customer can persist for decades while the methods beneath it churn.

Business Process and Value Stream

A process is how a capability is carried out, an ordered set of activities. A value stream, in the BIZBOK sense, is the end-to-end flow that delivers a result to a stakeholder, and it decomposes into processes; together they are the business viewed "in motion". Where the capability map sits still, processes move constantly. The same Acquire Customer capability may be realised by a paper form this year and an AI-assisted flow the next, without the capability itself changing at all.

Bounded Context

From Eric Evans: a boundary within which a single domain model holds and the ubiquitous language carries one consistent meaning. Step across the border and order, or customer, may mean something else entirely. The bounded context is, before anything else, the place where ambiguity is contained.

Business Service

A capability made consumable: the contracted, externally visible offering, with defined inputs, outputs, and an expected behaviour. In the ArchiMate vocabulary a business service is what a capability and its processes expose to a consumer. In domain-driven terms it is the published interface a context offers across its boundary to those who do not live inside it.

These five terms are routinely muddled in practice, and the muddle is expensive. It is how an enterprise ends up automating the wrong thing, or measuring nothing. Hold a single distinction and the rest settle into place. Capability is what. Process is how. The business service is the contract. And the bounded context is the linguistic border that, if we have done our work honestly, should line up with the capability itself. The business architect and the domain modeller are very often describing the same boundary in two different dialects.

That coincidence is not a curiosity. It is the lever for everything that follows, and it is where the next section begins.

The Bridge: Capability as Bounded Context

I claimed at the end of the last section that the business architect and the domain modeller are often describing the same border. Let me make the claim properly.

When Fowler and Lewis set down the characteristics of microservices, the very first one was that services should be organized around business capability rather than technical layer, and they tied it straight to Conway's law: an organization that splits its teams by technology will produce systems split the same way, to its cost. The domain-driven community arrives at the same place from the modelling side, where the standing advice is to align each bounded context with a business capability. Two communities, two vocabularies, one boundary. The architect who treats them as separate problems is doing the work twice.

Why does the alignment hold so reliably? Because both are answers to the same question: where does one meaning end and the next begin? A capability is coherent when it does one thing the business recognises as one thing. A bounded context is coherent when one ubiquitous language holds inside it without contradiction. Those two conditions tend to be met by the same border, because a capability that needs two languages to describe it is not really one capability, and a context stretched across two capabilities will not hold a single language for long.

This is also why DDD simplifies the capability work rather than competing with it. The genuinely hard part of a capability map is never drawing the boxes; it is knowing where one box honestly ends and the next begins. That is exactly the problem domain-driven design has spent more than twenty years equipping us for. Ubiquitous language is the scalpel. When the same word carries two meanings depending on who is speaking, you are standing on a capability boundary, whether or not the map admits it yet.

In an earlier note I offered a set of heuristics for discovering bounded contexts, and at the time I dressed it up as software-design advice. I will now confess what it really was: a method for finding capability boundaries. Consider the signals. Linguistic boundaries. The shift in meaning of a shared term. Data that is owned in one place and merely referenced in another. Clusters of function and data that travel together. The quiet test of the "-tion" and the "-ing". None of these are coding concerns. They are how an architect tells where one capability stops, and they are why a workshop technique such as Event Storming or domain storytelling earns its place: it is the room in which a group discovers those edges together, in the business's own words.

There is a third rendering of the same boundary, and it has become hard to ignore. When Zhamak Dehghani set out data mesh she built it directly on Evans, arguing that the ownership of analytical data should follow the seams of the business domains, which makes a domain's bounded context the natural unit for distributing data ownership. A data product, in her definition, is the whole of code, data, and infrastructure bundled at the granularity of a bounded context. So the same border that names a capability and bounds a context also scopes a data product. From an identification standpoint this is a gift, because you do not run three separate discovery exercises. You find the boundary once, in the business's language, and it hands you the capability the business owns, the context the model must respect, and the data product the domain should serve. From a design standpoint the data product is built as a contract rather than a shared table, served through defined output ports with stated guarantees, which is the same instinct we will reach again when we get to the business service. Capability, context, data product: three views of one seam.

This is also the moment to correct a fashionable understatement. I have heard the business capability described as "the window the architect owns to converse with the business, and only that". The communication value is real. A capability map gives the enterprise a shared, stable vocabulary, and that alone earns its keep. But "only" sells it short. As this section has shown, the capability is not merely how we talk to the business; it is the seam along which the model, the data, and eventually the teams are cut. A window is a thing you look through. A capability is a thing you build on. Treating it as the former and not the latter is how organizations end up with a tidy map that governs nothing.

So, the bridge is not a metaphor. The capability map, the context map, and the data product are three renderings of one underlying structure, and the ubiquitous language is the instrument that keeps all three honest. Hold that, and the two questions waiting for us, which applications support which capability and where AI can safely be introduced, become far easier to answer.

From the Capability Map to the AI Opportunity

Now the practical exercise, the one that started this note. An architect sits down to document the capabilities, map the applications onto them, and mark the AI opportunities. With the bridge from the last section in hand, each of those three steps gains something it did not have before.

Documenting the capabilities comes first. A capability map is a single rationalized business view, one box per capability, decomposed to perhaps three levels, and the Open Group treats it as one of the four core elements of business architecture alongside the value-stream, information, and organization maps. The discipline that matters here is the one from the last section: name each capability in the business's own language, and resist the urge to name it after whatever system happens to deliver it today. A capability called "the SAP thing" is a capability you have already lost.

Then the applications go on top. This is where the map starts to pay rent. When several applications support one capability, you have found a rationalization candidate. When a critical capability rests on aging or fragile systems, you have found a risk. When a capability is supported by nothing, you have found a gap. The technique is mechanical to describe and surprisingly powerful in practice, and it is exactly how large estates discover that one important capability is quietly propped up by dozens of overlapping legacy applications. Tooling from the enterprise architecture vendors, LeanIX, Ardoq, Orbus, Bizzdesign and others, exists largely to make this linkage governable at scale, but the insight does not depend on the tool. Linking an application to the capability it serves is what lets an organization see whether a system supports something the business actually values or merely persists.

Finding the AI opportunity is the third step, and here the established method only takes you halfway. The Open Group's guidance is to heat-map the capabilities along dimensions such as maturity, effectiveness, performance, and the value or cost of each to the business. The bright candidates for investment are the capabilities high in value and low in maturity or performance, and that is, lately, exactly where people draw the AI arrow. Useful, but incomplete, because the heat map answers only one of the two questions that matter. It tells you where AI would be valuable. It says nothing about where AI would be safe.

For safety, look at the boundary, and this is where the bridge earns its keep. A capability that resolves cleanly to a single bounded context, one ubiquitous language, one model, clear ownership of its data, is a capability an AI agent can be introduced into with some confidence, because it has a bounded language to speak and a bounded world to reason about. A capability whose boundary is muddy, where the same terms mean different things to different stakeholders and the data is owned everywhere and nowhere, is precisely where a language model will do what language models do: collapse the distinctions, average the meanings, and hand back something fluent and wrong.

Hence, the architect now holds two filters rather than one. The heat map supplies value. The cleanliness of the bounded context supplies safety. The place to begin is the intersection of the two: a high value, under served capability that also happens to have a crisp boundary. The places to leave alone, or to repair before automating, are the high-value capabilities with fuzzy borders, because there the AI will not fail loudly. It will pass the demo and mislead in production. And a capability you cannot draw a clean boundary around at all is telling you something the heat map never could, which is that you do not yet understand it well enough to hand it to a machine.

That last point, about not understanding a thing well enough to automate it, is really a point about measurement. It brings us to the awkward question of where the business process sits in all this.

Is the Process the Architect's Concern?

A fair objection arrives here. If the architect's unit of account is the capability, and the capability is deliberately silent about how anything is done, then is the business process any concern of the architect's at all? Process is the territory of business process management and of the people who run the work day to day. It changes constantly. On the face of it, it sits below the altitude this note has been keeping.

My believe is that the process is not the architect's modelling unit, but it is the architect's measurement surface, and that distinction is the whole point.

Recall the framing from the first section. The capability map is the business at rest; the value stream and the processes beneath it are the business in motion. The Business Architecture Guild treats the value stream as one of its core domains precisely because it is where value is actually created, stage by stage, and a value stream decomposes into the processes that realize it. ArchiMate makes the same relationship concrete when it has a business process realize a stage of a value stream. So, process is not absent from the architect's world. It enters at the altitude of the value stream, where the architect cares that value is delivered and can say by how much, and it leaves the architect's hands at the altitude of the swim lane, where the detailed sequence is properly the concern of the process owner.

Now hold that against the AI question, because the two halves fit together exactly. You place an AI opportunity against a capability, because the capability is where the boundary and the language live. But you cannot measure an AI opportunity against a capability, because a capability is an abstraction at rest and has no number attached to it. You measure it against the process or the value stream it is meant to change, because that is where the numbers live: cycle time, throughput, error rate, cost to serve, time to a decision. The capability tells you where the AI goes. The value stream tells you whether it did any good.

This resolves the tension in the original question. Process matters to the architect not as something to model in detail, but as the surface on which the value of an automation is proven or disproven. An AI initiative that cannot name the process metric it intends to move has chosen a place to stand and no way to know whether standing there helped. In my experience that is the more common failure, and it is the quieter one, because everyone can see the agent working and nobody can say whether the work is worth anything.

There is a stronger version of the same question worth meeting head on, because it is increasingly in fashion: the claim that the architect should measure business performance against the capability itself. I think this is partly right and mostly a stretch, and the difference matters. It is right that a capability can and should be measured along the dimensions of capability based planning, what tooling vendors such as Bizzdesign frame as strategic importance, maturity, and adaptability. Those are assessments of the capability's health and worth, and they are exactly what feeds the heat map. Some go further: Ronald Ross and the business rules school argue that a capability can carry goal based and policy based metrics derived from the strategy it serves, and there is even patented machinery for rolling the performance of underlying entities up into a capability level service level expectation. So, the idea is not fringe.

Where it stretches is in collapsing the two states, we have been careful to keep apart. A capability is the business at rest; performance is generated in motion. The literature on metrics is consistent that operational performance, the cycle time and the conversion and the cost to serve, is the performance of a process, not of an abstraction. To hang a live operational number directly on a capability box is to ask the at rest map to do the in motion map's job. The honest model is layered: the value stream and its processes carry the live performance; that performance rolls up to color the capability's heat map as maturity and effectiveness; and the capability itself carries assessment and strategic worth, not the operational dial. Measure the capability's health, by all means. Measuring the business's live performance against it, as though the capability were where the work happens, is the stretch your instinct is right to resist.

Which raises the sharper question, the one this whole note has been walking towards. It is one thing to name the metric. It is another to be able to test it, repeatedly and honestly, on a system that does not behave the same way twice. Before we get there, one term is still owed its definition in motion: the business service.

The Business Service as Contract

The business service is the term most often left vague, and it is the one that does the quiet work of making AI testable, so it earns a short section of its own.

A business service is a capability made consumable. Where the capability says what the business can do, the service is that capability offered to a consumer under a contract: these are the inputs, this is the output, this is the behavior you may rely on. In ArchiMate it is what a capability and its processes expose outward. In the domain driven vocabulary, it is the published interface a bounded context presents across its boundary, what the strategic design patterns call an Open Host Service speaking a Published Language. I wrote in an earlier note that an agent improvising your domain's grammar is a bug with good manners; the business service is the cure, because it hands the agent a grammar to speak rather than leaving it to invent one.

This gives the architect a second test surface, and a different kind of one. The process, from the last section, is where you measure whether the AI created value: did the cycle time fall, did the error rate drop. The business service is where you assert whether the AI behaved: given these inputs, did it return an output that honors the contract, within the stated tolerance. One is a question of outcome, the other a question of conformance, and a serious automation has to answer both. An agent can move the process metric while quietly breaching the service contract on the cases nobody sampled, and an agent can honor the contract perfectly while moving no metric worth the cost.

The deeper value of the service is that it forces the boundary to be written down. A capability with no published service is a capability whose expectations live in people's heads, and you cannot hold a machine to expectations that were never stated. The act of defining the service is the act of making the capability testable. Which is, at last, the subject we have been circling.

The Testable Edge

We can now state the rule the whole note has been moving towards, and I owe it partly to a question a colleague put to me, which ran roughly thus: if I cannot run a reproducible test on a process, I should not enable it with AI, because I will not be able to measure its consistency and accuracy. The instinct is right, and it is already the discipline behind any serious experiment. If you cannot say in advance what success looks like and measure it afterwards, you have not run a test, you have run a demonstration. But the rule needs one correction before it is safe to lean on, and the correction is where the interesting part lives.

The correction is this. You will never get a reproducible test out of the agent's behavior, because agentic systems are not deterministic. The same input yields different reasoning paths, different tool calls, and different outputs from one run to the next, and the agent evaluation literature is blunt that traditional tests, which assume one input gives one output, simply stop working when pointed at them. So, if "reproducible test on the process" is read as the process behaving identically each time, the rule would forbid almost everything useful, because almost nothing built on a language model behaves identically twice.

What one makes reproducible is not the behavior but the yardstick. You fix an evaluation set, you version it, and you score against it with methods stable enough to trust, deterministic checks where the answer is structural and calibrated judgement where it is not, reading the result as a distribution rather than a single pass. This is also where two words, consistency and accuracy, turn out to be two measurements rather than one. A benchmark study of tool-using agents, τ-bench, found that even strong function calling agents succeeded on fewer than half the tasks on a single attempt, and on a harder domain held that success only about a quarter of the time when the same task was repeated eight times, a reliability gap its authors captured in a metric they named Pass@k. Accurate, in other words, and not consistent. You only see that gap if the test runs the same task many times against the same fixed set, which is precisely the reproducible evaluation your rule should be demanding, in place of the reproducible behavior it can never have.

For an agent specifically, even a fixed yardstick is not the whole story, because an agent can arrive at a right answer through reasoning that will not survive a slightly different case, and the evaluation literature treats a correct outcome reached by flawed reasoning as a reliability risk rather than a success. The serious practice looks at the trajectory and the outcome both, while resisting the opposite error of grading a rigid, prescripted path. Anthropic's recent engineering guidance on agent evaluation puts it plainly: grade what the agent produced, not the exact route it took, because a capable agent will find a route you did not foresee. A right answer reached by luck is still a liability wearing the costume of a success.

Two refinements follow, and both are familiar.

First, the bar should scale with how much rope you give the agent, which is only risk tiering by another name. An advisory agent with a human reading every output, the Scribe and the Critic from an earlier note, can live with a lighter evaluation, because the human is the acceptance gate. An agent that acts without that gate needs the full apparatus, trajectory and outcome, consistency and accuracy, run at volume. The frameworks the field is converging on, from the NIST AI Risk Management Framework and its insistence that a system be valid and reliable within its context of use, to the tooling that versions evaluation datasets for repeatability, all say the same thing in different dialects: reliability is a property you measure, not a property you hope for.
Second, and this is the line that ties the whole note together, an untestable process is almost always an unbounded one. If you cannot write a reproducible evaluation for a piece of work, look closely and you will usually find that its success criteria were never agreed, which means its ubiquitous language was never settled, which means its bounded context, and therefore the capability beneath it, is still a blur. Your rule and the rule from the third section are the same rule seen from two ends. There I said a clean boundary is where AI is safe and a muddy one is where it returns confident nonsense. Here we can say why: the clean boundary is the one you can write a test for. Untestable, unbounded, and unsafe are three words for one condition.

So the rule, corrected, holds and is worth keeping. If you cannot construct a reproducible evaluation, a fixed set of cases and a stable way to score consistency and accuracy against them, then you do not yet understand the work well enough to hand it to an agent. The right move at that point is not to deploy and hope. It is to go back and find the boundary you have not yet drawn.

Let's Summarize

We covered a fair distance, so let me draw the threads together. The business capability, the bounded context, and the data product are three renderings of one boundary, and domain-driven design, with the ubiquitous language at its centre, is the instrument that finds that boundary and keeps all three honest. The capability is more than a window for talking to the business; it is the seam along which the model, the data, and eventually the teams are cut, and that is why getting it right simplifies everything downstream, including the search for where AI belongs.

That search needs two filters, not one. The heat map tells you where AI would be valuable; the cleanliness of the bounded context tells you where it would be safe; and the place to begin is the intersection. Process is not the architect's modelling unit but the architect's measurement surface, which is why you place AI against a capability and measure it against the value stream, and why measuring the business's live performance against the capability itself is a stretch worth resisting. The business service is the contract you hold the result to, the conformance test that sits beside the value test.

And the edge of all of it is testability. Reproducibility belongs to the evaluation, never to the agent's behavior; consistency and accuracy are two measurements, not one; the bar rises with autonomy. The single sentence to carry away is the one the last section earned: an untestable process is an unbounded one, and a thing you cannot draw a clean boundary around, or write a reproducible test for, is a thing you do not yet understand well enough to hand to a machine.

...and some references

As ever, this was written across several sittings, and some links may have aged; write to me if I have misattributed something or missed a debt.

Domain-driven design and microservices

Eric Evans, Domain-Driven Design: Tackling Complexity in the Heart of Software (Addison-Wesley, 2003)Vaughn Vernon, Implementing Domain-Driven Design (Addison-Wesley, 2013)Martin Fowler and James Lewis, "Microservices" (2014), https://martinfowler.com/articles/microservices.htmlJames Lewis, "Microservices and the inverse Conway manoeuvre" (GOTO Copenhagen, 2015)Melvin Conway, "How Do Committees Invent?" (1968), http://www.melconway.com/Home/Committees_Paper.htmlBusiness Architecture Guild, A Guide to the Business Architecture Body of Knowledge (BIZBOK Guide), https://www.businessarchitectureguild.orgWilliam Ulrich and Jim Rhyne, "Business Architecture: The Real Tie that Binds" (Business Architecture Guild)The Open Group, TOGAF Standard, Business Architecture: Business Capabilities, https://pubs.opengroup.org/togaf-standard/business-architecture/business-capabilities.htmlThe Open Group, ArchiMate Specification (business service, business function, and value-stream relationships)Bizzdesign, "How to measure business capability aspects," https://bizzdesign.com/blog/how-to-measure-business-capability-aspectsRonald G. Ross, "Strategy-Based Metrics for Measuring Business Performance," Business Rules Journal, https://www.brcommunity.com/articles.php?id=b659Zhamak Dehghani, "Data Mesh Principles and Logical Architecture" (2020), https://martinfowler.com/articles/data-mesh-principles.htmlZhamak Dehghani, Data Mesh: Delivering Data-Driven Value at Scale (O'Reilly, 2022)Anthropic, "Demystifying evals for AI agents" (Anthropic Engineering, 2026), https://www.anthropic.com/engineering/demystifying-evals-for-ai-agentsShunyu Yao, Noah Shinn, Pedram Razavi and Karthik Narasimhan, "τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains," arXiv:2406.12045 (2024), https://arxiv.org/abs/2406.12045NIST, AI Risk Management Framework (AI RMF 1.0), the Measure function, https://www.nist.gov/itl/ai-risk-management-frameworkDatabricks, "What is AI Agent Evaluation?" (on versioned datasets and reproducible evaluation), https://www.databricks.com/blog/what-is-agent-evaluation

Companion notes

The Stranger Who Speaks Every Language, on what language models do to bounded contextsThe Machine That Auditioned for Three Parts, on where AI belongs in solution architecturePlowing Architectural Notions

Cheers,
Mohammad Malekmakan

Disclaimer:

All opinions and content published in my blog and my social networks are solely my own, not those of my employer(s) and the communities I am contributing in.