Academic Papers

Research

Landmark papers in prompt engineering, LLMs, agents, RAG, and AI safety — with plain-English summaries and links to related practical lessons.

Foundations Prompting Techniques Agents & Tool Use Retrieval & Context Safety & Alignment Evaluation

Foundations

The architectural and scaling breakthroughs that made modern LLMs possible.

Attention Is All You Need

Vaswani et al. · 2017 · NeurIPS

arXiv:1706.03762

Introduced the transformer architecture — the backbone of every modern LLM. Replaced recurrent networks with self-attention, enabling full parallelization during training and the ability to model long-range dependencies. Every model you use today (GPT-4o, Claude, Gemini) is a transformer.

ArchitectureTransformerAttention→ How LLMs Work

Scaling Laws for Neural Language Models

Kaplan et al. (OpenAI) · 2020 · arXiv

arXiv:2001.08361

Showed that model performance follows predictable power-law relationships with model size, dataset size, and compute budget. This paper is why the industry spent the next four years scaling up — it provided a mathematical basis for predicting that larger models would reliably be better.

ScalingTrainingCompute

Language Models are Few-Shot Learners (GPT-3)

Brown et al. (OpenAI) · 2020 · NeurIPS

arXiv:2005.14165

Introduced GPT-3 and demonstrated that sufficiently large language models can perform new tasks from just a few examples in the prompt — without any fine-tuning. Established few-shot prompting as a first-class capability and showed that scale unlocks qualitatively new behaviors.

Few-ShotIn-Context LearningGPT-3→ Few-Shot Prompting Lesson

Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

Ouyang et al. (OpenAI) · 2022 · NeurIPS

arXiv:2203.02155

Showed how RLHF (Reinforcement Learning from Human Feedback) dramatically improves the usability and safety of language models. A 1.3B InstructGPT model outperformed a 175B GPT-3 model on human preference evaluations. The foundational paper behind ChatGPT and modern instruction-tuned models.

RLHFInstruction TuningAlignment

Finetuned Language Models Are Zero-Shot Learners (FLAN)

Wei et al. (Google Brain) · 2021 · ICLR

arXiv:2109.01652

Demonstrated that instruction-tuning — fine-tuning on a diverse set of tasks described in natural language — substantially improves zero-shot performance. Showed that the instruction format (how you describe a task in the prompt) matters as much as the task itself.

Instruction TuningZero-ShotFine-Tuning

Prompting Techniques

Papers that defined the core techniques practiced in prompt engineering today.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei et al. (Google Brain) · 2022 · NeurIPS

arXiv:2201.11903

Introduced few-shot chain-of-thought prompting: providing worked-out reasoning examples in the prompt to elicit step-by-step problem-solving from the model. Showed dramatic accuracy improvements on math, commonsense, and symbolic reasoning benchmarks. The most-cited prompting technique paper.

Chain of ThoughtReasoningFew-Shot→ Chain of Thought Lesson

Large Language Models are Zero-Shot Reasoners

Kojima et al. · 2022 · NeurIPS

arXiv:2205.11916

Showed that simply appending 'Let's think step by step' to a prompt (zero-shot CoT) substantially improves LLM performance on reasoning tasks — no examples needed. One of the most impactful and immediately practical findings in prompting research.

Chain of ThoughtZero-ShotReasoning→ Chain of Thought Lesson

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang et al. (Google Research) · 2022 · ICLR

arXiv:2203.11171

Introduced self-consistency: generating multiple reasoning chains via sampling, then taking a majority vote on the final answer. Improves over greedy CoT by averaging out reasoning errors. A simple, model-agnostic technique that consistently boosts accuracy on hard reasoning tasks.

Self-ConsistencyChain of ThoughtSampling

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Yao et al. · 2023 · NeurIPS

arXiv:2305.10601

Generalized CoT by allowing models to explore multiple reasoning branches, evaluate intermediate thoughts, and backtrack — like a search tree rather than a linear chain. Significantly outperforms CoT on tasks requiring exploration, such as the 24-point card game and creative writing.

Tree of ThoughtSearchPlanning→ Tree of Thought Lesson

ReAct: Synergizing Reasoning and Acting in Language Models

Yao et al. · 2022 · ICLR

arXiv:2210.03629

Introduced the ReAct prompting pattern: alternating between Thought (reasoning), Action (tool call), and Observation (processing the result). Enabled LLMs to reliably use external tools like Wikipedia and calculators. The canonical pattern for modern AI agents.

AgentsTool UseReAct→ AI Agents Track

Automatic Chain of Thought Prompting in Large Language Models

Zhang et al. · 2022 · ICLR

arXiv:2210.11610

Showed that manually crafting CoT examples can be automated by clustering questions and using zero-shot CoT to generate demonstrations automatically. Removes the human effort from few-shot CoT construction without sacrificing accuracy.

Chain of ThoughtAutomationFew-Shot

Self-Refine: Iterative Refinement with Self-Feedback

Madaan et al. · 2023 · NeurIPS

arXiv:2303.17651

A prompting framework where the model generates an initial output, then critiques it, then refines it — iteratively. No additional training required. Improves outputs across code generation, math, and text tasks by leveraging the model's own ability to identify and fix errors.

RefinementSelf-FeedbackIterative

Agents & Tool Use

Research on building AI systems that reason, act, and use external tools.

Toolformer: Language Models Can Teach Themselves to Use Tools

Schick et al. (Meta AI) · 2023 · NeurIPS

arXiv:2302.04761

Showed that LLMs can be taught to use external APIs (calculator, search engine, calendar) by self-supervised fine-tuning on their own generated examples of successful tool use. Foundational for the tool-use capabilities in modern models.

Tool UseAgentsFunction Calling→ Function Calling Lesson

HuggingGPT: Solving AI Tasks with ChatGPT and Its Friends in HuggingFace

Shen et al. · 2023 · NeurIPS

arXiv:2303.04671

Demonstrated using an LLM as a task planner to coordinate specialized AI models (image generators, speech models, etc.) as tools. An early example of multi-modal, multi-model agent orchestration and the LLM-as-controller pattern.

Multi-AgentOrchestrationPlanning→ Multi-Agent Systems

Cognitive Architectures for Language Agents

Sumers et al. · 2023 · TMLR

arXiv:2309.02427

A survey that taxonomizes AI agent architectures using the lens of cognitive science — memory, action, planning, and execution. Useful conceptual framework for understanding and designing production agent systems.

AgentsMemoryArchitecture→ AI Agents Track

Retrieval & Context

Papers on RAG, long context, and how models use the information they're given.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis et al. (Meta AI / UCL) · 2020 · NeurIPS

arXiv:2005.11401

Introduced RAG: combining a retrieval model (finding relevant documents) with a generation model (producing the answer). Showed that grounding generation in retrieved documents dramatically reduces hallucinations and improves factual accuracy. The paper that launched the modern RAG industry.

RAGRetrievalGrounding→ RAG Guide

Lost in the Middle: How Language Models Use Long Contexts

Liu et al. · 2023 · TACL

arXiv:2307.03172

Showed that LLMs perform significantly worse when relevant information is placed in the middle of a long context — they recall well from the beginning and end, but 'lose' information in the middle. Has major implications for RAG chunk ordering and prompt construction strategies.

Context WindowRAGLong Context→ Working with Long Documents

In-Context Retrieval-Augmented Language Models

Shi et al. · 2023 · TACL

arXiv:2302.00083

Analyzed how effectively LLMs actually use retrieved context vs. their parametric knowledge. Found that models sometimes ignore retrieved evidence and hallucinate from training data. Motivates careful prompt design to explicitly instruct models to rely on provided context.

RAGHallucinationContext Engineering

Safety & Alignment

Research on making AI systems safe, honest, and resistant to misuse.

Constitutional AI: Harmlessness from AI Feedback

Bai et al. (Anthropic) · 2022 · arXiv

arXiv:2212.08073

Introduced Constitutional AI (CAI): using a set of human-written principles (a 'constitution') and AI-generated critiques to train harmless models without needing as many human labels. The approach behind Claude's safety training.

AlignmentSafetyRLHFClaude→ Risks & Safety Track

Prompt Injection Attacks and Defenses in LLM-Integrated Applications

Liu et al. · 2023 · IEEE S&P

arXiv:2306.05499

Systematic study of prompt injection attacks in real-world LLM applications — how malicious content in user input or retrieved documents can override system instructions. Evaluated defenses and found that no existing defense is fully robust. Essential reading for anyone building production AI systems.

Prompt InjectionSecurityRed-Teaming→ Prompt Injection Lesson

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Lin et al. · 2022 · ACL

arXiv:2109.07958

Introduced TruthfulQA, a benchmark of questions designed to elicit model hallucinations — questions where humans commonly hold false beliefs. Showed that larger models are more confidently wrong on some question types, challenging the assumption that scale improves truthfulness.

HallucinationEvaluationTruthfulness→ Hallucinations Lesson

Jailbroken: How Does LLM Safety Training Fail?

Wei et al. · 2023 · NeurIPS

arXiv:2307.02483

Analyzed why safety fine-tuning fails against jailbreaks. Identified two root causes: competing objectives (the model trained to be helpful and harmless has these objectives in tension) and mismatched generalization (safety training covers fewer scenarios than attack surface). Explains why no model is jailbreak-proof.

JailbreakingSafetyRed-Teaming→ Jailbreaking Lesson

Evaluation

Benchmarks and frameworks for measuring LLM and prompting system performance.

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (BIG-Bench)

Srivastava et al. (204-author collaboration) · 2022 · TMLR

arXiv:2206.04615

A massive collaborative benchmark of 204 tasks covering reasoning, code, math, linguistics, and more — designed to test capabilities beyond existing benchmarks. Introduced BIG-Bench Hard (BBH), the subset that frontier models still struggle with.

EvaluationBenchmarksReasoning

Holistic Evaluation of Language Models (HELM)

Liang et al. (Stanford CRFM) · 2022 · NeurIPS

arXiv:2211.09110

A framework for multidimensional evaluation of LLMs across accuracy, calibration, robustness, fairness, efficiency, and more. Emphasizes that no single metric is sufficient for comparing models — the prompt format used during evaluation substantially affects results.

EvaluationBenchmarksFairness

Large Language Models are not Fair Evaluators

Wang et al. · 2023 · ACL

arXiv:2305.17926

Showed that when using LLMs as judges (LLM-as-judge evaluation), the order in which candidates are presented significantly biases the result — models consistently favor the first or second option. Critical methodological reading for anyone building LLM-based evaluation systems.

EvaluationLLM-as-JudgeBias→ Evaluation Frameworks Lesson

Put the Theory into Practice

The Learn tracks cover every technique from these papers with practical examples and exercises.

Start Learning