LessWrong March 18, 2026 at 10:52 PM

tastybench-toward-measuring-research-t...

1 correction found

Claim

We pulled 200 papers published between January and March 2024 with at least 20 citations from Semantic Scholar with the query “reinforcement learning large language model llm rl”, and measured the average number of citations they accrued per month over the first 15 months since publication. We removed all papers that were not strictly algorithmic-level RL improvements aimed at math, coding, and reasoning tasks. This gave us a “ground truth” ranking of 38 papers ranked in order of citation velocity.

Correction

The linked “ground truth” CSV does not match this description: it contains papers with publication dates outside January–March 2024, and the file has 38 total lines including a header, which means 37 paper rows rather than 38 papers.

Full reasoning

The post links an immutable GitHub CSV as the claimed “ground truth” ranking. That file contradicts the description in two ways:

It is not limited to January–March 2024 papers. The linked CSV includes entries with pub_date values such as 2025-03-18 (DAPO: An Open-Source LLM Reinforcement Learning System at Scale), 2025-01-22 (Kimi k1.5: Scaling Reinforcement Learning with LLMs), 2024-09-19 (Training Language Models to Self-Correct via Reinforcement Learning), 2023-08-17 (Reinforced Self-Training (ReST) for Language Modeling), and 2023-03-30 (Language Models can Solve Computer Tasks). Those dates fall outside the stated January–March 2024 window.
The linked file does not contain 38 papers. GitHub shows the file has 38 lines total. The raw file begins with a header row (paperId,title,pub_date,...), so 38 total lines implies 37 data rows/papers, not 38 papers.

So, as written, the post's description of the linked “ground truth” dataset is incorrect: the linked file both includes out-of-range publication dates and appears to have 37 papers rather than 38.

2 sources

tastybench/model-elicitation/data/llm_rl_yix_curate.csv at 8cbde7a757d064c59229e2ace042d74804dd55b7 · parviam/tastybench · GitHub
GitHub shows this file has “38 lines (38 loc) · 18.3 KB.”
llm_rl_yix_curate.csv (raw file at commit 8cbde7a757d064c59229e2ace042d74804dd55b7)
The raw CSV starts with the header `paperId,title,pub_date,...` and includes entries such as `DAPO... ,2025-03-18`, `Kimi k1.5... ,2025-01-22`, `Training Language Models to Self-Correct... ,2024-09-19`, `Reinforced Self-Training (ReST)... ,2023-08-17`, and `Language Models can Solve Computer Tasks,2023-03-30`.

Model: OPENAI_GPT_5 Prompt: v1.16.0