Most conversations about adopting AI in business focus on capability. Can it do the task? How fast? How accurately? What does the demo look like?
These are fine questions. They're just not the most important ones.
The questions that matter come after the demo. What happens when the output is wrong? Who owns that? And crucially: how would you even know?
There are three separate problems here, and they tend to get tangled together. Unpicking them is useful.
The first is accountability
When a third-party AI system produces an incorrect output, whether that's a miscalculation, a bad recommendation, or a report built on flawed data, the question of who is responsible is rarely as clear as it should be.
Most AI vendor contracts are written to limit the vendor's liability for output accuracy. The software is a tool. What you do with the output is your decision. This is not buried in the small print; it's usually the central position. You are the professional. You are accountable for the work product.
This means if you adopt an AI tool for any serious function and something goes wrong, the liability typically stays with you or your organisation. The vendor supplied a calculator. You chose to trust the answer.
This is worth confirming explicitly with whoever advises you on compliance or professional liability before you deploy anything, not after. The question is not "is the AI reliable?" The question is "who carries the risk if it isn't?"
The second is verification
Once you understand that liability stays with you, the natural follow-up is: how much checking are you expected to do?
This is a practical resource question as much as a governance one. Checking every AI output completely defeats the efficiency argument for AI. Checking none of them is not a defensible position if something goes wrong. So what's the standard?
There is no universal answer to this, which is part of the problem. In regulated environments, the standard tends to be "enough to demonstrate that a reasonable professional exercised appropriate oversight." What that means in practice depends on the stakes of the specific task, the track record of the specific tool, and the judgment of the specific professional signing off.
For lower-stakes procedural tasks, statistical sampling is usually defensible. For high-stakes outputs, anything that gets used without human review is a governance question worth having explicitly documented. Not because auditors are coming, but because the discipline of writing it down forces you to think clearly about where the risk actually sits.
The third problem is the hardest one, and it's the one most vendors can't answer cleanly
For procedural AI tasks, the ones doing calculations or sorting data according to defined rules, you can usually trace the output back to the inputs and verify the logic. It's complicated, but it's auditable in principle.
For reasoning AI tasks, analysis, synthesis, advice, recommendations, the question of how the system reached its conclusion is much harder to answer. Not harder in the sense that vendors are hiding something. Harder in a more fundamental sense.
Large language models do not produce outputs the way a spreadsheet formula produces outputs. They generate responses based on patterns learned across enormous volumes of training data, shaped by the prompt and any documents you've provided. The process is probabilistic, not deterministic. The same prompt can produce different outputs on different runs. There is no single chain of reasoning you can step through and verify.
This creates a specific problem for professional contexts: if you cannot see what sources the AI used, what it considered irrelevant, or what data it failed to collect, you cannot fully audit whether the output is complete and correct. You can check whether the answer looks right. You cannot always check whether something important was quietly left out.
That is a different kind of risk from a calculation error, because it's invisible by default.
Some systems are better than others at this. Tools built on retrieval-augmented generation, where the AI explicitly pulls from a defined document set and can cite sources, give you at least partial auditability. You can see what documents it referenced. You can check whether the right ones are in the set. You cannot always tell why it weighted one passage over another, but it's a start.
Fully closed systems, where the AI draws on its training data with no visibility into the sources, offer the least auditability for professional use. The output might be excellent. You just have limited ability to verify the reasoning.
When evaluating any AI tool for a serious professional function, the questions worth asking the vendor are:
- Can I see which sources or documents were used to produce this output?
- If the system decided something was not relevant to the query, is there any record of that?
- If the output draws on data that could be time-sensitive or client-specific, what controls exist to ensure it's using the correct data set?
- What does the system do when it is uncertain? Does it flag uncertainty, or does it produce a confident-sounding answer regardless?
If the vendor cannot answer these questions in concrete terms, that is information. It doesn't mean the tool is bad. It means you're taking on more of the verification burden yourself.
None of this is an argument against using AI in professional work. The tools are genuinely useful and the efficiency gains are real.
It is an argument for being clear-eyed about where the risk sits, rather than discovering it in a situation where something has gone wrong and everyone is looking for whose problem it is.
The organisations that will handle AI adoption well are the ones that ask these questions before they sign anything, build verification into their processes in proportion to the stakes involved, and treat AI outputs the way they would treat work from a capable but junior colleague: useful, often good, worth checking before it goes anywhere that matters.
That mindset is not scepticism about AI. It's just professional discipline applied consistently.
