Academic Papers
Research
Landmark papers in prompt engineering, LLMs, agents, RAG, and AI safety — with plain-English summaries and links to related practical lessons.
Foundations
The architectural and scaling breakthroughs that made modern LLMs possible.
Attention Is All You Need
Vaswani et al. · 2017 · NeurIPS
Introduced the transformer architecture — the backbone of every modern LLM. Replaced recurrent networks with self-attention, enabling full parallelization during training and the ability to model long-range dependencies. Every model you use today (GPT-4o, Claude, Gemini) is a transformer.
Scaling Laws for Neural Language Models
Kaplan et al. (OpenAI) · 2020 · arXiv
Showed that model performance follows predictable power-law relationships with model size, dataset size, and compute budget. This paper is why the industry spent the next four years scaling up — it provided a mathematical basis for predicting that larger models would reliably be better.
Language Models are Few-Shot Learners (GPT-3)
Brown et al. (OpenAI) · 2020 · NeurIPS
Introduced GPT-3 and demonstrated that sufficiently large language models can perform new tasks from just a few examples in the prompt — without any fine-tuning. Established few-shot prompting as a first-class capability and showed that scale unlocks qualitatively new behaviors.
Training Language Models to Follow Instructions with Human Feedback (InstructGPT)
Ouyang et al. (OpenAI) · 2022 · NeurIPS
Showed how RLHF (Reinforcement Learning from Human Feedback) dramatically improves the usability and safety of language models. A 1.3B InstructGPT model outperformed a 175B GPT-3 model on human preference evaluations. The foundational paper behind ChatGPT and modern instruction-tuned models.
Finetuned Language Models Are Zero-Shot Learners (FLAN)
Wei et al. (Google Brain) · 2021 · ICLR
Demonstrated that instruction-tuning — fine-tuning on a diverse set of tasks described in natural language — substantially improves zero-shot performance. Showed that the instruction format (how you describe a task in the prompt) matters as much as the task itself.
Prompting Techniques
Papers that defined the core techniques practiced in prompt engineering today.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wei et al. (Google Brain) · 2022 · NeurIPS
Introduced few-shot chain-of-thought prompting: providing worked-out reasoning examples in the prompt to elicit step-by-step problem-solving from the model. Showed dramatic accuracy improvements on math, commonsense, and symbolic reasoning benchmarks. The most-cited prompting technique paper.
Large Language Models are Zero-Shot Reasoners
Kojima et al. · 2022 · NeurIPS
Showed that simply appending 'Let's think step by step' to a prompt (zero-shot CoT) substantially improves LLM performance on reasoning tasks — no examples needed. One of the most impactful and immediately practical findings in prompting research.
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Wang et al. (Google Research) · 2022 · ICLR
Introduced self-consistency: generating multiple reasoning chains via sampling, then taking a majority vote on the final answer. Improves over greedy CoT by averaging out reasoning errors. A simple, model-agnostic technique that consistently boosts accuracy on hard reasoning tasks.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Yao et al. · 2023 · NeurIPS
Generalized CoT by allowing models to explore multiple reasoning branches, evaluate intermediate thoughts, and backtrack — like a search tree rather than a linear chain. Significantly outperforms CoT on tasks requiring exploration, such as the 24-point card game and creative writing.
ReAct: Synergizing Reasoning and Acting in Language Models
Yao et al. · 2022 · ICLR
Introduced the ReAct prompting pattern: alternating between Thought (reasoning), Action (tool call), and Observation (processing the result). Enabled LLMs to reliably use external tools like Wikipedia and calculators. The canonical pattern for modern AI agents.
Automatic Chain of Thought Prompting in Large Language Models
Zhang et al. · 2022 · ICLR
Showed that manually crafting CoT examples can be automated by clustering questions and using zero-shot CoT to generate demonstrations automatically. Removes the human effort from few-shot CoT construction without sacrificing accuracy.
Self-Refine: Iterative Refinement with Self-Feedback
Madaan et al. · 2023 · NeurIPS
A prompting framework where the model generates an initial output, then critiques it, then refines it — iteratively. No additional training required. Improves outputs across code generation, math, and text tasks by leveraging the model's own ability to identify and fix errors.
Agents & Tool Use
Research on building AI systems that reason, act, and use external tools.
Toolformer: Language Models Can Teach Themselves to Use Tools
Schick et al. (Meta AI) · 2023 · NeurIPS
Showed that LLMs can be taught to use external APIs (calculator, search engine, calendar) by self-supervised fine-tuning on their own generated examples of successful tool use. Foundational for the tool-use capabilities in modern models.
HuggingGPT: Solving AI Tasks with ChatGPT and Its Friends in HuggingFace
Shen et al. · 2023 · NeurIPS
Demonstrated using an LLM as a task planner to coordinate specialized AI models (image generators, speech models, etc.) as tools. An early example of multi-modal, multi-model agent orchestration and the LLM-as-controller pattern.
Cognitive Architectures for Language Agents
Sumers et al. · 2023 · TMLR
A survey that taxonomizes AI agent architectures using the lens of cognitive science — memory, action, planning, and execution. Useful conceptual framework for understanding and designing production agent systems.
Retrieval & Context
Papers on RAG, long context, and how models use the information they're given.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Lewis et al. (Meta AI / UCL) · 2020 · NeurIPS
Introduced RAG: combining a retrieval model (finding relevant documents) with a generation model (producing the answer). Showed that grounding generation in retrieved documents dramatically reduces hallucinations and improves factual accuracy. The paper that launched the modern RAG industry.
Lost in the Middle: How Language Models Use Long Contexts
Liu et al. · 2023 · TACL
Showed that LLMs perform significantly worse when relevant information is placed in the middle of a long context — they recall well from the beginning and end, but 'lose' information in the middle. Has major implications for RAG chunk ordering and prompt construction strategies.
In-Context Retrieval-Augmented Language Models
Shi et al. · 2023 · TACL
Analyzed how effectively LLMs actually use retrieved context vs. their parametric knowledge. Found that models sometimes ignore retrieved evidence and hallucinate from training data. Motivates careful prompt design to explicitly instruct models to rely on provided context.
Safety & Alignment
Research on making AI systems safe, honest, and resistant to misuse.
Constitutional AI: Harmlessness from AI Feedback
Bai et al. (Anthropic) · 2022 · arXiv
Introduced Constitutional AI (CAI): using a set of human-written principles (a 'constitution') and AI-generated critiques to train harmless models without needing as many human labels. The approach behind Claude's safety training.
Prompt Injection Attacks and Defenses in LLM-Integrated Applications
Liu et al. · 2023 · IEEE S&P
Systematic study of prompt injection attacks in real-world LLM applications — how malicious content in user input or retrieved documents can override system instructions. Evaluated defenses and found that no existing defense is fully robust. Essential reading for anyone building production AI systems.
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Lin et al. · 2022 · ACL
Introduced TruthfulQA, a benchmark of questions designed to elicit model hallucinations — questions where humans commonly hold false beliefs. Showed that larger models are more confidently wrong on some question types, challenging the assumption that scale improves truthfulness.
Jailbroken: How Does LLM Safety Training Fail?
Wei et al. · 2023 · NeurIPS
Analyzed why safety fine-tuning fails against jailbreaks. Identified two root causes: competing objectives (the model trained to be helpful and harmless has these objectives in tension) and mismatched generalization (safety training covers fewer scenarios than attack surface). Explains why no model is jailbreak-proof.
Evaluation
Benchmarks and frameworks for measuring LLM and prompting system performance.
Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (BIG-Bench)
Srivastava et al. (204-author collaboration) · 2022 · TMLR
A massive collaborative benchmark of 204 tasks covering reasoning, code, math, linguistics, and more — designed to test capabilities beyond existing benchmarks. Introduced BIG-Bench Hard (BBH), the subset that frontier models still struggle with.
Holistic Evaluation of Language Models (HELM)
Liang et al. (Stanford CRFM) · 2022 · NeurIPS
A framework for multidimensional evaluation of LLMs across accuracy, calibration, robustness, fairness, efficiency, and more. Emphasizes that no single metric is sufficient for comparing models — the prompt format used during evaluation substantially affects results.
Large Language Models are not Fair Evaluators
Wang et al. · 2023 · ACL
Showed that when using LLMs as judges (LLM-as-judge evaluation), the order in which candidates are presented significantly biases the result — models consistently favor the first or second option. Critical methodological reading for anyone building LLM-based evaluation systems.
Put the Theory into Practice
The Learn tracks cover every technique from these papers with practical examples and exercises.
Start Learning