Your AI Training Is Flying on Vibes

You measure satisfaction. You measure completion. You have no idea if anything changed.

We are building aircraft in a world with no altimeter.

Every Fortune 500 company has an AI training program now. Billions of dollars are flowing into enablement, upskilling, "AI literacy." And almost universally, these programs measure their effectiveness the same way: satisfaction surveys and completion rates.

Did participants enjoy it? Did they finish it? Great. What we don't know: Did anything actually change?

I've been running AI enablement programs since 2024. Six-session cohorts, intensive workshops, one-on-one coaching. Every program generates artifacts: prompts, workflows, configured tools. What we don't have is a systematic way to measure growth across those artifacts or predict who needs what next. We rely on confidence surveys and NPS scores -- the same instruments the research literature tells us are nearly useless for predicting actual performance.

This is my attempt to fix that. It's called RANGE.

The Measurement Gap

Here's the uncomfortable truth about your AI training metrics: the personnel selection literature has known for decades that self-report measures barely predict real-world performance.

Sackett et al.'s 2022 meta-analysis -- correcting systematic errors in earlier work -- found that structured interviews predict job performance at .42 validity (on a 0-to-1 scale where 1 is perfect prediction). Work samples predict at .33. Self-report measures? They predict at .07 to .15.[^1]

Read that again. The gap between behavioral assessment and self-report is not incremental. It's a different category of measurement. Yet the AI training industry overwhelmingly relies on the weakest type of measurement available. We ask people if they feel more confident. We ask if they'd recommend the training. We put those numbers in a slide deck and call it evidence.

It's not evidence. It's vibes.

The problems are acute for AI specifically. Most people are novices. Novices systematically overestimate their own competence -- the Dunning-Kruger effect is maximally distorting in exactly this kind of domain. Someone who prompts ChatGPT with "write me a marketing email" and gets a mediocre result may rate themselves highly if they've never seen expert prompting in action. They don't know what they don't know. And our surveys can't tell the difference.

What RANGE Is

RANGE stands for Reach, Autonomy, Navigation, Generalization, Execution Fidelity. It's a framework for decomposing AI fluency into five dimensions that practitioners can see, name, and act on.

The easiest way to understand it: think about a fitness assessment. A good coach doesn't just tell you "you're in shape." They measure endurance, strength, flexibility, and balance separately -- because someone strong but inflexible needs a completely different program than someone flexible but weak. The Dreyfus model tells you someone is "competent" with AI. RANGE tells you what kind of competent -- and what to prescribe next. Someone high on Reach but low on Execution Fidelity takes on ambitious projects and ships sloppy work. Someone high on Execution Fidelity but low on Navigation delivers quality on familiar tasks but freezes when the problem is ambiguous.

Here's what each dimension captures:

| Dimension | What it measures | |-----------|-----------------| | Reach | How far into unfamiliar territory can you extend AI-augmented work? | | Autonomy | How independently can you close AI-augmented loops? | | Navigation | How well do you orient in ambiguity and choose the right path? | | Generalization | How well do you transfer AI skills across domains and contexts? | | Execution Fidelity | How reliably do you ship quality outcomes with AI? |

Each dimension is rooted in established psychology. Reach maps to self-efficacy -- your domain-specific belief that your actions can produce desired effects.[^2] Autonomy maps to locus of control -- whether you attribute outcomes to your own actions or to external forces.[^3] Navigation draws on self-regulated learning. Generalization draws on transfer theory. Execution Fidelity maps to psychological ownership -- the feeling that something is "mine" to ship.

These aren't abstract categories. They're things a facilitator can observe in a two-hour workshop.

The person who only picks the safe exercise? Low Reach. The person who raises their hand every three minutes? Low Autonomy. The person who nails the writing prompt but can't adapt it for data analysis? Low Generalization. The person who builds something impressive but never checks if it actually works? Low Execution Fidelity. The person who stares at an open-ended task and says "what am I supposed to do"? Low Navigation.

Every one of these is a real person I've worked with. And every one of them needs a different intervention.

Gap Signatures: Why Profiles Matter More Than Scores

This is where RANGE earns its keep. Two learners at the same overall score with different gap signatures need completely different training.

Profile A: High Reach, Low Autonomy. This person attempts ambitious tasks but needs constant hand-holding. They'll volunteer for the stretch exercise, then immediately ask "is this right?" every two minutes. They believe AI can do big things. They just don't believe they can direct it without supervision.

The intervention: scaffolded solo practice with progressive fading. Start with full support, then partial support, then minimal guidance, then independent execution. At each step, explicit debriefs that connect outcomes to the learner's choices: "The AI produced this because you structured the prompt this way." Attribution retraining shifts from "the AI is smart" to "I directed the AI effectively."

Profile B: High Autonomy, Low Generalization. This person works independently and confidently -- in their one domain. They've got a killer marketing prompt workflow. Ask them to use similar techniques for project planning and they're lost. They've learned procedures, not principles.

The intervention: explicit bridging instruction. And here the evidence delivers an uncomfortable finding. Sala and Gobet's second-order meta-analysis shows the unbiased far-transfer effect size is essentially zero.[^4] Training someone to use AI for writing will not automatically transfer to using AI for data analysis, no matter how similar the underlying skills feel to the trainer. Transfer requires deliberate intervention: making learning contexts similar to target contexts, and explicitly abstracting principles for conscious application across domains. "What principle from your writing prompts applies to analysis prompts?" That question doesn't happen by accident. You have to design for it.

This is the fundamental advantage of a multi-dimensional framework over a single score. "Improve your AI skills" is useless advice. "Your Reach is ahead of your Navigation -- try tackling an ambiguous problem where you have to define the approach before you start prompting" is actionable.

What We Found So Far

I'll be direct: RANGE is a hypothesis, not a validated instrument. But the programs it's shaping are producing real results.

Our first intensive cohort -- 14 participants across two sessions -- generated these numbers:

NPS: 100. Individual scores: 10, 9, 10, 10, 9, 10. Average 9.67. All promoters.
Confidence shift: +150%. From 1.67 to 4.17 out of 5.
Completion: 100%. Every respondent finished all four builds.
Instruction clarity: 5 out of 5 across the board.

Yes, I know what I just said about confidence surveys and NPS being weak measures. I'm citing them because they're what we have right now, and because the extreme values are worth noting even with a weak instrument. A confidence shift from 1.67 to 4.17 on a 5-point scale, with 100% task completion, is a signal -- not proof, but a signal -- that something real happened in the room.

The qualitative data reinforces it. One participant wrote: "Lee makes it very approachable. I am determined even more to master Claude now." Another said the best part was "leaving with tangible tools that I can use in my work." These aren't people rating a webinar. They built things, and they left with those things.

What We're Testing Next

Phase 1 is a measurement overlay on top of the existing curriculum. The plan:

1. One behavioral assessment per dimension. A Prompt Quality Evaluator for Execution Fidelity. A Problem Selection Task for Reach -- present five to eight problems of varying scope, see which ones participants choose to tackle, score for ambition and self-awareness. A Transfer Challenge for Generalization -- take something you built in one domain and apply it to a different one. Navigation and Autonomy start as structured observation notes.

2. A 10-item self-assessment. Two items per dimension, administered at the first and last sessions alongside the existing diagnostic. Takes two minutes.

3. A spreadsheet. One row per participant per session. No database. No dashboards. Just data.

4. One cohort. Run the measurement through one complete program. Look at the trajectories. Do they differentiate? Do gap signatures make sense? Does the self-report correlate with the behavioral measures?

Phase 1 is a signal test, not a validation study. It tells us whether the instrument produces differentiated, interpretable data worth investing in. Not whether the framework is psychometrically validated. We're honest about that distinction.

The decay question matters too. Arthur et al.'s meta-analysis shows severe skill decay after a year or more of nonuse -- roughly a standard deviation and a half of loss.[^5] Tatel and Ackerman's 2025 meta-analysis -- 1,344 effect sizes from 457 reports -- estimates the half-life for accuracy-based cognitive skills at approximately 6.5 months.[^6] AI prompting and workflow design are cognitive, accuracy-based tasks. The fastest-decaying category. A program that ends without reinforcement will see learners lose roughly half their gains by month nine.

That's why the homework loop -- participants using AI on real work between sessions -- isn't a nice-to-have. It's the primary decay prevention mechanism. Without on-the-job application and spaced retrieval practice, any training program is building on sand.

We're not alone in this direction. Anthropic just published their AI Fluency Index (February 2026), analyzing 9,830 real conversations to measure observable behavioral fluency.[^7] Their key finding reinforces our thesis: iteration is the primary correlate of fluency -- users who iterate showed 2.67x more fluency behaviors than those who don't. They also found that critical evaluation behaviors declined when users were creating polished artifacts -- the better the output looks, the less people question it. That's a red flag for any training program that celebrates shiny deliverables without teaching skepticism. Anthropic's framework measures 24 behaviors from conversation logs; RANGE measures 5 dimensions from training artifacts. Different lenses, same conviction: behavioral measurement is the path forward.

Three Things to Steal

Even if you never use RANGE, here are three things you can take from this work today:

1. Replace at least one self-report measure with a behavioral one. Pick your highest-priority dimension. Instead of asking "how confident are you with AI prompting?" give people a task, score the output, and track the score over time. One behavioral measure is worth ten confidence surveys. If you want an easy starting point: have participants select from a set of problems at different difficulty levels. What they choose to attempt tells you more about their growth than what they say about their growth.

2. Design for transfer or accept that it won't happen. The far-transfer effect size is essentially zero without deliberate intervention. If your training teaches AI for writing and you expect people to generalize to AI for data analysis on their own, you're hoping for something the evidence says doesn't happen. Add explicit bridging: "What principle from this exercise applies to your other work?" Make people practice the transfer, not just the skill.

3. Plan for decay from day one. The half-life of the skills you're teaching is about 6.5 months. If your program ends with a certificate and a survey, you've built a sandcastle. Build the reinforcement into the design: homework that requires real-world application, buddy pairs for accountability, a booster session at the 90-day mark, periodic prompt challenges. The training isn't the product. The sustained behavior change is.

RANGE is a bet. It's a bet that AI fluency can be decomposed into dimensions that are independently measurable, that different profiles respond to different interventions, and that behavioral assessment outperforms self-report in a domain where self-report is maximally unreliable. Some of those bets will pay off. Some won't.

But the alternative -- continuing to fly on vibes, measuring satisfaction instead of capability, designing one-size-fits-all training for a multi-dimensional problem -- is a bet too. It's just a bet we've already been losing.

Build the instrument. Run the test. Let the data tell you what to build next.