An Empirically-Grounded Formalization of Traditional vs AI-Augmented Research Labs
We model a research lab over time $T$ (in years) consisting of $n_{PhD}$ PhD researchers and $n_{PI}=1$ principal investigator.
| Variable | Symbol | Description |
|---|---|---|
| T | Simulation time (years) | Typically T = 2 years |
| n_PhD | Number of PhD researchers | 3 in our simulation |
| λ₀ | Base task completion rate | Tasks per unit time per researcher |
| η | Learning rate | Efficiency gain per year |
Each researcher allocates their time across discrete tasks. Let $f_k$ be the fraction of time spent on task type $k$:
| Task k | f_k | Source |
|---|---|---|
| Boilerplate code | 0.15 | Feldon et al. 2017 |
| Custom code | 0.20 | Feldon et al. 2017 |
| Debugging | 0.10 | Feldon et al. 2017 |
| Writing | 0.15 | Feldon et al. 2017 |
| Literature review | 0.10 | Feldon et al. 2017 |
| Data analysis | 0.15 | Feldon et al. 2017 |
| Simulation | 0.10 | Feldon et al. 2017 |
| Validation | 0.05 | Model assumption |
The average speedup is:
$$\bar{s} = \sum_{k} f_k \cdot s_k \tag{1}$$
| Task | Symbol | Value | Source |
|---|---|---|---|
| Boilerplate code | s_boiler | 1.55x | Peng et al. 2023 — RCT: 55.8% faster on simple tasks |
| Writing drafts | s_write | 1.40x | Noy & Zhang 2023 — RCT: +40% for below-median writers |
| Debugging | s_debug | 1.30x | Estimate based on Peng et al. finding |
| Literature review | s_lit | 1.30x | Estimate (no published RCT) |
| Data analysis | s_data | 1.15x | Conservative estimate |
| Custom code | s_custom | 1.00x | METR study — 0% for experienced devs |
| Simulation | s_sim | 1.00x | Conceptual: no AI for physical experiments |
| Validation | s_valid | 0.90x | Modeled: hallucination checking overhead |
Critical: METR study found zero speedup for experienced developers on real tasks.
Computing $\bar{s}$:
$$\bar{s} = 0.15(1.55) + 0.20(1.00) + 0.10(1.30) + 0.15(1.40) + 0.10(1.30) + 0.15(1.15) + 0.10(1.00) + 0.05(0.90) = 1.220$$
AI productivity gains are skill-biased (Noy & Zhang 2023):
$$s_{eff}(\sigma) = s \cdot (1 + \alpha \cdot (1 - \sigma)) \tag{2}$$
With $\sigma \sim N(0.5, 0.15^2)$:
$$E[s_{eff}] = \bar{s} \cdot (1 + \alpha \cdot (1 - \mu_s)) = 1.220 \cdot 1.20 = 1.464 \tag{3}$$
$$E_{trad}(T) = \lambda_0 \cdot T \cdot n_{PhD} \cdot (1 + \eta_{trad} \cdot T/2) \tag{4}$$
where $\eta_{trad} = 0.15$/year.
$$E_{aug}(T) = \lambda_0 \cdot T \cdot n_{PhD} \cdot (1 + \eta_{aug} \cdot T/2) \cdot E[s_{eff}] \tag{5}$$
where $\eta_{aug} = 0.20$/year.
$$R(T) = \frac{E_{aug}(T)}{E_{trad}(T)} = \frac{1 + \eta_{aug}T/2}{1 + \eta_{trad}T/2} \cdot E[s_{eff}] \tag{6}$$
At $T = 2$ years:
$$R(2) = \frac{1.20}{1.15} \cdot 1.464 = 1.528$$
The model predicts an output factor of approximately 1.5x — not 3-5x claimed by vendors.
Research directions progress through stages:
$$P_{pub} = p_{idea} \cdot p_{method} \cdot p_{exp} \cdot p_{analysis} \cdot p_{write} \cdot p_{submit} \cdot p_{accept} \tag{7}$$
| Stage | Probability |
|---|---|
| p_idea | 0.80 |
| p_method | 0.70 |
| p_exp | 0.60 |
| p_analysis | 0.50 |
| p_write | 0.70 |
| p_submit | 0.80 |
| p_accept | 0.65 |
$P_{pub} \approx 0.12$ (12% of directions become publications)
$$E[P]_{aug} = E_{aug}(T) \cdot P_{pub} \cdot (1 - p_{halluc}) \tag{8}$$
where $p_{halluc} \approx 0.10-0.15$
$$p_{dead,aug} = p_{dead,trad} \cdot (1 - \beta) + \gamma \cdot p_{halluc} \tag{9}$$
| Parameter | Range | Effect on R |
|---|---|---|
| Mean skill μs | 0.3-0.7 | Lower skill → higher R |
| Skill-bias α | 0.2-0.6 | Higher α → higher R |
| Hallucination rate | 0.05-0.25 | Higher → lower R |
| Time T | 1-5 years | Longer T → higher R |
| Metric | Traditional | Augmented | Delta |
|---|---|---|---|
| Avg speedup | 1.00x | 1.22x | +22% |
| Skill-biased effective | 1.00x | 1.46x | +46% |
| Output ratio (T=2yr) | 1.00x | 1.53x | +53% |
The mathematical model predicts ~1.5x output factor, not 3-5x. This emerges from task-specific speedups, skill bias, and validation overhead — consistent with METR's finding of 0% gain for experienced developers.