LEARNING TIME: A TAXONOMY

Feb 4
15 min read

Why Some Things Take 10,000 Hours and Others Take 10 Seconds

A Framework for Understanding Task Complexity Across Biological and Artificial Intelligence

This paper is a companion to Critical Depth: When More Layers Unlock New Capabilities - which explores how neural network capabilities emerge suddenly at specific architectural depths, paralleling biological development. Here we ask: if certain tasks require minimum processing depth, what determines how long it takes to learn those depths?

The Observation

Not all learning is equal. Consider how long it takes humans to acquire different capabilities:

These timescales span six orders of magnitude — from seconds to decades. This isn't random. Something about the structure of these tasks determines how long they take to learn.

If we could understand why certain tasks require certain learning times, we might:

Predict how much data/compute AI systems need for different capabilities
Design more efficient training curricula
Understand which AI capabilities are "easy" vs. "fundamentally hard"

Proposed Dimensions of Task Complexity

What makes a task require more learning time? We propose several contributing factors:

1. Input Space Dimensionality

How many variables must the learner track?

Low dimensionality:

Thermostat: 1 variable (temperature)

Learning time: trivial

Medium dimensionality:

Chess board: 64 squares × ~6 piece types

Learning time: years to master

High dimensionality:

Visual scene understanding: millions of pixels

Social dynamics: dozens of people × histories × relationships

Learning time: years to decades

2. Exception Density

How many edge cases exist relative to the rule?

Low exception density:

Physics: F = ma works universally

Learning time: hours to grasp concept

High exception density:

English spelling: "i before e except after c" — but weird, seize, height...

Learning time: years of exposure

Extreme exception density:

Social norms: vary by culture, context, relationship, mood...

Learning time: lifetime, never fully complete

3. Temporal Credit Assignment

How far back in time must you look to understand what caused an outcome?

Immediate feedback:

Touch hot stove → pain (milliseconds)

Learning time: one trial

Short-term:

Chess move → capture piece (seconds to minutes)

Learning time: months

Medium-term:

Study habits → exam results (weeks)

Learning time: years to develop intuition

Long-term:

Career decisions → life outcomes (years to decades)

Learning time: often never fully learned

4. Abstraction Depth

How many layers of abstraction must be built before the task makes sense?

This dimension connects directly to the "critical depth" phenomenon explored in our companion paper — where neural networks show sudden capability jumps at specific layer counts (4 layers: flailing, 16 layers: walking, 256 layers: vaulting).

Shallow abstraction:

Pattern matching: "red means stop"

Learning time: minutes

Medium abstraction:

Arithmetic: numbers → operations → relationships

Learning time: years

Deep abstraction:

Scientific reasoning: observations → hypotheses → theories →

paradigms

Legal reasoning: facts → precedents → principles → justice

Learning time: decades of training

5. Implicit Knowledge Requirements

How much unstated background knowledge is assumed?

Low implicit knowledge:

Tic-tac-toe: rules are the complete specification

Learning time: minutes

Medium implicit knowledge:

Cooking: requires understanding of heat, texture, flavour chemistry

Learning time: months to years

High implicit knowledge:

Understanding humour: requires cultural context, social norms,

linguistic nuance, shared history, emotional intelligence

Learning time: accumulates over lifetime

6. Feedback Sparsity

How often does the learner receive useful learning signals?

Dense feedback:

Video games: constant score updates

Learning time: hours to proficiency

Sparse feedback:

Parenting: outcomes visible 20+ years later

Learning time: never definitive

Adversarial feedback:

Poker: opponents actively disguise signals

Learning time: years to expert level

A Proposed Classification System

Combining these dimensions, we might classify tasks into rough categories:

These levels map suggestively onto the critical depths observed in the Wang et al. NeurIPS 2025 paper — where 4-layer networks could only flail, 16-layer networks could walk, and 256-layer networks could vault over obstacles.

Level 0: Reactive Tasks

Reflexes, simple pattern matching
Low dimensionality, immediate feedback, minimal abstraction
Learning time: seconds to hours
AI equivalent: simple classifiers, trained in minutes

Level 1: Skill Tasks

Motor skills, procedural knowledge, routine expertise
Medium dimensionality, short-term credit assignment
Learning time: tens to hundreds of hours
AI equivalent: game-playing agents, image classifiers

Level 2: Fluency Tasks

Language, reading, social basics
High dimensionality, many exceptions, implicit knowledge
Learning time: thousands of hours (2-5 years)
AI equivalent: language models, requiring billions of tokens

Level 3: Expertise Tasks

Professional domains, strategic thinking, complex judgment
Deep abstraction, sparse feedback, long-term consequences
Learning time: ~10,000 hours
AI equivalent: ???? (current frontier)

Level 4: Wisdom Tasks

Life navigation, ethical judgment, understanding one's place in the world
Maximum dimensionality, lifetime feedback horizons, all implicit
Learning time: decades, perhaps never complete
AI equivalent: not yet achieved

A Note on Right-Sizing

This taxonomy isn't just descriptive - it's prescriptive. If you've correctly identified a task as Level 1, you don't need a Level 3 system to solve it.

Most real-world AI applications are bounded, well-specified problems: classify this image, route this request, flag this transaction. These are Level 0-1 tasks. They don't require 100-layer networks, embodied feedback loops, or meta-learning. A simple classifier trained on labelled examples will do.

The deep architectures, multimodal integration, and sophisticated learning regimes we discuss in this paper are requirements for general intelligence and open-world tasks - not for all AI deployment. Matching system complexity to task complexity is itself a skill. Over-engineering a Level 1 problem with a Level 3 system wastes resources; under-engineering a Level 3 problem with a Level 1 system guarantees failure.

Moravec's Paradox: Why "Easy" is Hard

In the 1980s, roboticist Hans Moravec observed something counterintuitive:

"It is comparatively easy to make computers exhibit adult-level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility."

What humans find "hard" (chess, mathematics, logic) is easy for AI. What humans find "easy" (walking, catching a ball, common sense) is hard. This seems paradoxical — until you consider why things feel easy or hard to us.

The resolution: Tasks feel "easy" precisely because we've mastered them so completely the effort is invisible. Walking draws on 500 million years of evolutionary refinement. Common sense draws on decades of embodied experience. The ease is an illusion created by mastery so deep it's unconscious.

But there's another dimension that Moravec's Paradox reveals:

Closed-World vs Open-World Problems

Chess and mathematics are closed-world problems: the complete state is given, the rules are explicit, and success is clearly defined. AI excels here because the problem is fully specified.

Walking and common sense are open-world problems: you operate with incomplete information, must predict what's hidden, and integrate noisy signals from multiple sources. Success requires filling gaps through imagination, simulation, and intuition built from experience.

This adds a crucial dimension to our taxonomy:

Human perception is multimodal by default. We can't disaggregate the data streams flowing into us — vision, sound, touch, proprioception, memory, imagination all blend continuously. This makes us naturally suited to open-world problems but makes those problems look easy.

AI systems, by contrast, typically receive curated, unimodal, complete inputs. They excel at closed-world problems because that's what they're given. The challenge isn't that AI is bad at "easy" tasks — it's that "easy" tasks are actually open-world problems requiring prediction, imagination, and integration of incomplete information across time.

The Theory of Mind Example

Theory of mind — the ability to attribute mental states to others — takes children 3-5 years to develop. Why so long?

What theory of mind requires:

Input dimensionality: Tracking multiple agents' beliefs, desires, knowledge states
Exceptions: People lie, deceive, have hidden motives, change their minds
Temporal span: Understanding requires remembering past interactions
Abstraction: Beliefs about beliefs (I think that she thinks that he knows...)
Implicit knowledge: Cultural norms, emotional understanding, context
Feedback: Sparse — you rarely learn definitively if you modelled someone correctly

This places theory of mind solidly at Level 2-3 complexity. A child needs thousands of social interactions before the patterns stabilise enough to generalise.

Rough calculation:

Waking hours age 0-5: ~5 years × 365 days × 10 hours = 18,250 hours

Social interaction time: perhaps 30% = ~5,500 hours

Relevant learning experiences: perhaps 20% = ~1,100 hours

This aligns with Level 2-3 learning requirements.

Implications for AI Training Efficiency

This framework suggests several insights:

1. Current AI is inefficient for some task types

Humans develop theory of mind from ~1,100 hours of relevant experience. Current AI systems require vastly more data for comparable social reasoning. This suggests we're missing architectural or algorithmic efficiencies that biology has discovered.

2. Some capabilities may require irreducible amounts of experience

If a task has high exception density and sparse feedback, there may be no shortcut. You simply need to encounter enough cases. This might set floor for training data regardless of architectural improvements.

3. Curriculum matters

Humans don't learn randomly — we progress through developmental stages. A child learns object permanence before theory of mind, concrete operations before abstract reasoning. The 10,000 hour rule assumes deliberate practice with appropriate challenges.

AI training could benefit from similar curricula: start with simpler versions of a task and progressively increase complexity.

4. Transfer learning is the key efficiency

Humans don't learn each domain from scratch. A doctor learning surgery benefits from years of prior learning: biology, anatomy, motor skills, judgment under pressure. The "10,000 hours" builds on a foundation of 100,000+ hours of prior general learning.

AI systems that can transfer effectively between domains should show dramatic efficiency gains.

The Depth Connection

This connects back to the "critical depth" paper. Perhaps:

Level 0 tasks require shallow networks (few processing steps)
Level 1 tasks require moderate depth
Level 2-3 tasks require deep networks (1000+ layers?)
Level 4 tasks require depth we haven't yet achieved — or fundamentally different architectures

The sudden capability jumps at "critical depths" might correspond to these task levels. A 4-layer network can do Level 0 tasks. You need 16 layers for Level 1. You need 64+ layers for Level 2. And so on.

Part II: What The Simple Taxonomy Misses

The framework above treats learning as accumulating hours of exposure. But recent developments in AI — particularly large language models — have revealed surprises that suggest learning is more complex than "data in, capability out."

These aren't lessons about LLMs specifically. They're lessons about learning itself, revealed through the lens of systems that learn at scale.

Lesson 1: Capabilities Emerge Unexpectedly

When researchers built transformer architectures for language

modelling, they chose a simple objective: predict the next token. But as these models scaled, capabilities emerged that nobody explicitly trained for:

In-context learning (learning from examples in the prompt)
Chain-of-thought reasoning
Code generation
Cross-lingual transfer

This connects directly to the critical depth paper. The Princeton researchers found that scaling depth in reinforcement learning produced sudden capability jumps — behaviours that didn't exist at all in shallow networks suddenly appeared at critical thresholds.

The general principle: Learning systems can develop capabilities that weren't explicitly targeted. Architecture + scale + objective interact in ways we don't fully understand. This applies to biological learning too — a child learning language somehow also learns logic, categorisation, and social reasoning along the way.

Current language models cluster around 60-100 layers. The Princeton paper achieved 1024 layers for robotics. Does language need less depth, or have we not yet explored what deeper language models could do?

Two interpretations:

Language may need less depth — Text is already compressed; someone else did the hard work of converting messy reality into clean symbolic sequences. Physics is unforgiving in ways language isn't.

Or we're at a threshold — GPT-3's emergent capabilities at 96 layers roughly correspond to the 64-256 layer critical range in the RL paper. Scaling to 500+ layers might unlock capabilities we can't anticipate.

The deeper question: Both robotics and language face the same fundamental challenge — building a model of the world where actions are sensible and "correct." For robots, physics provides ground truth. For language, what's "correct"? The next token objective doesn't distinguish true from false. Perhaps the lesson from the Princeton paper isn't about depth but about objectives that connect predictions to reality.

Lesson 2: Different Objectives Build Different World Models

The taxonomy above assumes a single type of "learning." But different systems are designed to predict fundamentally different things:

Each captures something different. A language model trained on text builds a world model — but it's a world model of what humans write, not of physics or causation directly. It learns that "the ball fell" often follows "she dropped the ball" — but not why, not through experiencing gravity.

The general principle: The objective function determines what kind of world model emerges. This applies beyond AI — a child learning through play builds different intuitions than one learning through instruction. An apprentice watching a master develops different knowledge than one reading a textbook.

Key insight: The Princeton paper succeeded because contrastive RL (classification: "which future state matches?") scaled better than traditional RL (regression: "what value?"). The objective and architecture must match. This suggests our taxonomy needs another dimension: what kind of prediction is the learning system optimising for?

Lesson 3: "Basic" Tasks May Be Level 3-4 Mastery Problems

This lesson extends Moravec's Paradox from Part I. A persistent criticism of current AI is that it struggles with tasks humans consider "basic" — common sense reasoning, critical thinking, detecting obvious falsehoods. But as we saw, this framing is backwards.

Common sense is neither common nor easy. In our taxonomy, it's a Level 3-4 mastery task — an open-world problem requiring decades of multimodal experience with long-term feedback loops:

Physical intuition: Understanding that a heavy object will break a glass table comes from years of interacting with objects, experiencing weight, seeing things break, getting hurt.
Social reasoning: Knowing that a job interviewer's "we'll be in touch" probably means rejection comes from dozens of social interactions with delayed, often ambiguous feedback.
Causal reasoning: Understanding that wet roads cause accidents (not the reverse) comes from experiencing weather, driving, near-misses — integrated over years.
Critical thinking: Recognising a scam requires having been deceived, having trusted wrongly, having seen others burned — feedback that arrives months or years after the initial decision.

The general principle: What looks "basic" is often the residue of thousands of hours of embodied experience with real consequences. A system trained on text sees the outputs of human reasoning but not the inputs — the years of feedback loops that shaped it.

Consider the training signals:

This reframes AI's "failures": These aren't failures at basic tasks. They're the predictable result of missing the feedback loops that make information matter. Adding more text won't solve this — you need:

Embodied consequences (physical intuition requires a body that can be hurt)
Long-term feedback (critical thinking requires seeing beliefs proven wrong over time)
Real stakes (common sense requires caring about outcomes)

This suggests that "common sense" might emerge with sufficient depth + the right objectives + appropriate curriculum — but probably can't be learned from text alone, regardless of scale.

Lesson 4: Learning is Multimodal and Integrated

A human child learning about the world doesn't train on separate modalities sequentially. Everything happens simultaneously, with rich cross-modal connections:

Watching someone ride a bike while feeling your own balance while hearing the wheels while imagining yourself doing it — all happening simultaneously, all reinforcing each other.

The general principle: Deep integration — where seeing a lemon makes you taste sourness, where hearing a word conjures visual imagery — creates richer representations than processing modalities separately. A child learning "ball" sees balls, touches balls, throws balls, hears the word, says the word, watches balls bounce, feels disappointed when one rolls away. Thousands of interlocking experiences.

This might explain efficiency gaps between biological and artificial learning. The same "hours of exposure" produce different outcomes depending on how tightly integrated the learning signals are.

Lesson 5: Learning Changes How You Learn

Perhaps the most profound observation: learning is not static accumulation — it's iterative refinement of the learning process itself.

Stage 1: Unconscious Incompetence

You don't know what you don't know. You can't even formulate good questions.

Novice chess player: "How do I win?"

- Objective: vague

- Feedback: "I lost" (sparse, uninformative)

- Questions: wrong level of abstraction

Stage 2: Conscious Incompetence

You start to see what you're missing. Questions become possible.

Improving player: "Why did that trade hurt me?"

- Objective: more specific

- Feedback: can identify key moments

- Questions: targeted at specific weaknesses

Stage 3: Conscious Competence

You can do things correctly with effort. Objectives are clear.

Intermediate player: "I need to control the centre before attacking"

- Objective: clear sub-goals

- Feedback: nuanced evaluation

- Questions: strategic and deep

Stage 4: Unconscious Competence

Skills become automatic. Attention freed for higher-level concerns.

Master player: "The position 'feels' wrong"

- Objective: intuitive pattern recognition

- Feedback: internal sense of quality

- Questions: about opponent's psychology, tournament strategy

The general principle: At each stage, the learner develops better objectives, better reward signals, finer control, and more sophisticated questions. This is meta-learning — learning how to learn. And it compounds over time.

This is Ericsson's "deliberate practice" research: 10,000 hours only produces expertise if those hours involve increasingly sophisticated, targeted practice at the edge of current ability. Same hours, vastly different outcomes depending on whether meta-learning is active.

Bringing It Together: The Ingredients for Learning

The five lessons above — combined with Moravec's Paradox — suggest that effective learning requires several ingredients working together:

1. Sufficient Depth

The critical depth paper shows that some capabilities only emerge at certain processing depths. You can't shortcut this — a 4-layer network cannot do what a 256-layer network does, regardless of training time.

2. Appropriate Objectives

The objective function determines what world model emerges. Next-token prediction builds a model of text. Future-state prediction builds a model of dynamics. Contrastive learning builds a model of similarity. The objective must match what you actually want the system to learn.

3. Meaningful Feedback Loops

Learning requires feedback that connects predictions to consequences. Text-only training lacks the "costly mistake" signal that shapes common sense. Embodied learning, long-term consequences, and real stakes create feedback that makes information matter.

4. Multimodal Integration

Richer representations emerge from integrated cross-modal learning. Separate processing of vision, language, and action misses the deep connections that make human learning efficient. This is also what enables handling open-world problems — filling gaps requires drawing on multiple sources simultaneously.

5. Open-World Capability

Real tasks involve incomplete information, hidden state, and the need to predict what's missing. Systems must learn to operate under uncertainty, simulate possibilities, and integrate partial signals — not just solve fully-specified problems.

6. Curriculum Progression

Learning should progress from simple to complex, with the curriculum adapting based on what has been mastered. Random exposure is less efficient than structured progression.

7. Meta-Learning

The system should get better at learning itself — refining objectives, improving feedback interpretation, asking better questions. Static training (fixed objective, random data) misses this compounding effect.

The synthesis: Current AI systems typically have some of these ingredients but not all:

This suggests why language models are impressive but incomplete. They have depth and scale, but objectives that don't connect to reality, weak feedback loops, limited multimodal integration, inputs that are too complete and clean, no curriculum, and no meta-learning.

The path forward might not be "more of the same" but rather assembling these ingredients properly: deeper architectures with objectives grounded in reality, trained on integrated multimodal data with meaningful feedback, exposed to open-world uncertainty, progressing through curricula, with systems that learn how to learn.

Open Questions

Can systems refine their own objectives? What would it mean for a loss function to become "more nuanced" during training?
What is the minimal multimodal integration needed? Do you need full embodiment, or can strategic cross-modal training provide most of the benefit?
Is meta-learning the key missing ingredient? If systems could learn how to learn, would efficiency gaps close dramatically?
Can emergent capabilities be predicted? We don't fully understand what architectures + objectives + scale will produce. Is this inherently unpredictable, or are we missing a theory?
How do we ground objectives in reality? The Princeton paper succeeded with contrastive learning because it connected predictions to verifiable outcomes. What's the equivalent for language and reasoning?
How do we train on open-world problems? Current AI receives curated, complete inputs. How do we expose systems to the partial information, hidden state, and need for imagination that characterises real-world tasks?

Practical Implications: Right Tool for the Right Job

This paper has focused on what's needed for general intelligence and Level 3-4 tasks. But it's worth stepping back: most AI deployment doesn't need this.

Not every problem requires deep intelligence.

The taxonomy suggests a matching principle:

Deploying a 100-billion parameter model for spam detection is like hiring a PhD physicist to read a thermometer. It works, but it's wasteful — and the over-complexity may introduce failure modes that a simpler system wouldn't have.

Composition vs. Monolithic Systems

There's an alternative to building one massive general system: compose multiple fit-for-purpose components. A workflow might use:

A simple classifier to route requests (Level 0)
A specialised model for domain-specific analysis (Level 1-2)
A general model only for genuinely open-ended reasoning (Level 3+)

This mirrors how organisations work. Not every employee needs to be a generalist genius. Most work is accomplished by people with specific, bounded competencies, coordinated effectively. The CEO doesn't personally handle every task — they orchestrate specialists.

When Do You Need General Intelligence?

The full ingredient list — depth, grounded objectives, meaningful feedback, multimodal integration, open-world handling, curriculum, meta-learning — is required when:

The task is genuinely open-world (incomplete information, need for prediction)
The domain has high exception density and sparse feedback
Transfer across contexts is required
The system must handle situations it wasn't explicitly trained for

If your application doesn't have these characteristics, simpler approaches will likely outperform complex ones — they're cheaper, faster, more interpretable, and have fewer failure modes.

The Taxonomy as a Design Tool

Before building or deploying AI, ask:

What level is this task, honestly assessed?
Is it closed-world or open-world?
What feedback loops exist?
Does it require transfer and generalisation, or is it bounded?

Match your system to these answers. The goal isn't maximum intelligence — it's appropriate intelligence for the job to be done.

Conclusion

The simple taxonomy — task complexity determines learning time — captures something real. Level 0 tasks take seconds; Level 4 tasks take decades. Depth matters; you can't skip stages.

But learning time alone doesn't explain efficiency gaps or capability emergence. The lessons from scaling AI systems reveal that learning requires multiple ingredients working together:

Depth to enable sufficient processing
Objectives that connect predictions to reality
Feedback loops that make information matter
Multimodal integration that builds rich representations
Curriculum that structures progression
Meta-learning that compounds improvement

Current AI systems have achieved remarkable results with only some of these ingredients — primarily depth and scale. The next advances may come not from more of the same, but from assembling the missing pieces.

The 10,000 hour rule isn't about 10,000 hours of passive exposure. It's about 10,000 hours of increasingly sophisticated, embodied, multi-modal, meta-learning-enhanced deliberate practice with meaningful feedback.

Building systems that learn like that — rather than like a static function being fitted to data — is the research programme suggested by taking both critical depth and learning time seriously.

This is a companion paper to Critical Depth: When More Layers Unlock New Capabilities. Together, they suggest that intelligence - biological or artificial - requires both sufficient processing depth AND the right learning conditions to build that depth. The critical depth paper asks "how deep?" This paper asks "how should we learn?"

This is a speculative framework for discussion. The specific mechanisms proposed are hypotheses to be tested. The goal is to move beyond "more data, more compute" toward understanding the structure of learning itself.