The Bank of England stress-tested an LLM — and the failures were quiet

Here's a small exercise. Think about the last time someone on your team suggested using a large language model to generate something that would feed into a decision — synthetic borrower profiles for a stress test, maybe, or a first-pass segmentation of your corporate book. Now ask yourself: how did you validate the output?

If the answer is "we eyeballed it" or "we checked a few examples and they looked reasonable," you're in good company. That's what most teams do. And last week, the Bank of England published a paper that shows exactly why eyeballing isn't enough.

What the Bank actually did

Staff Working Paper No. 1,190, published on 19 June, describes an experiment where the authors used GPT-3.5 Turbo to simulate the Bank's own Inflation Attitudes Survey — the quarterly poll of thousands of UK households that feeds into the MPC's understanding of how people perceive and expect inflation.

The setup is straightforward. They gave the model demographic personas — varying income, housing tenure, social class, age — and asked it the same questions the real survey asks real people. Then they compared the LLM's answers to the actual survey data.

Some of the results were impressive. The model reproduced known demographic gradients: lower-income personas reported higher perceived inflation, housing tenure shaped inflation perceptions in ways that matched the real data, and so on. If you squinted at the top-line charts, you'd think the LLM had learned something real about how different types of people experience price changes.

But the paper didn't stop at the top line. The authors decomposed the model's responses using a Shapley value decomposition to understand what was driving them — and that's where it got uncomfortable.

The kinks that matter

The LLM exhibited what the authors call "unexplained kinks." In certain demographic combinations, the model's inflation expectations jumped or dropped in ways that had no counterpart in the real survey data. It wasn't random noise. It was systematic — the model had learned patterns from its training data that looked plausible on the surface but didn't correspond to how actual UK households behave.

This is the part that matters for anyone in banking. The kinks weren't obvious. They didn't show up as absurd numbers or broken outputs. They showed up as subtle biases in specific subgroups — exactly the kind of error you'd miss if you validated by checking a handful of examples and moving on.

The Bank's researchers caught them because they had a ground truth to compare against: decades of real survey data, with known demographic patterns, that they could use as a benchmark. They found that the model's internal weighting of input features didn't match reality.

Why this isn't just an academic problem

The Bank of England had a dedicated research team, a known ground truth, and a well-understood survey instrument. They still found systematic errors. What does that mean for a credit team with less infrastructure?

I'll be honest: I've signed off on LLM-generated outputs with exactly the kind of validation this paper exposes as insufficient. Checked a few examples, confirmed they looked sensible, moved on. At the time it felt pragmatic. Reading this paper, it feels lazy. The difference between "looks reasonable" and "is actually correct" is precisely where these quiet failures live.

I've seen teams use LLMs to generate synthetic borrower profiles for portfolio stress testing. I've seen them used to augment thin data in sector-specific credit models. In every case, the validation was some version of "the outputs look reasonable to a human reviewer."

The BoE paper suggests that's not enough. The failures aren't in the outputs that look wrong — they're in the outputs that look right but carry hidden biases toward specific subgroups. In a credit context, that could mean your synthetic stress scenarios systematically underweight losses in a particular borrower segment, or your segmentation model treats two genuinely different risk profiles as equivalent because the LLM's training data conflated them.

❝

If we stripped the LLM's output back to the input features and decomposed what's actually driving each prediction, would the weightings match what we know from our own historical data?

If nobody on your team can answer that question, you don't have a validation process. You have a hope.

What the paper gives you — for free

The useful thing about this working paper isn't the headline finding. It's the methodology. The authors effectively published a template for testing whether an LLM's outputs are reliable for a specific task: define your ground truth, generate outputs across a structured set of input variations, decompose the drivers, and compare.

That template is directly transferable. If you're using an LLM to generate anything that feeds into a risk or credit decision — even indirectly — you can apply the same discipline. Define what "correct" looks like from your own historical data. Vary the inputs systematically. Check whether the model's sensitivities match yours.

It's not glamorous work. It's the kind of thing that takes a competent analyst a week, not a weekend. But the alternative is trusting outputs you haven't actually verified, from a model whose internal logic you don't control, in a domain where the errors are quiet and the consequences aren't.

The takeaway

This week, find the one place in your analytics pipeline where an LLM's output feeds — directly or indirectly — into a credit or risk decision. Ask the person who built it to show you the validation. Not "we checked some examples." The actual decomposition: which input features drive the output, and do those weightings match what your own portfolio data says they should be? If that decomposition doesn't exist, the BoE paper gives you a free template for building one. It's worth reading the methodology section even if you skip the inflation economics.

— Aksel

The Analytical Banker is a weekly note on data, analytics, and AI inside corporate banking — written for finance leaders who actually have to make this stuff work. Reply to this email if something here resonates, or forward it to a colleague who'd benefit.