About - TwinPeaks Bench

🎯 What is TwinPeaks Bench?

TwinPeaks Bench is a specialized LLM evaluation benchmark that tests how well large language models know obscure details from the cult classic TV series Twin Peaks. It consists of 26 carefully curated questions covering characters, plot details, and memorable moments from the show.

Unlike traditional benchmarks that test reasoning or coding abilities, TwinPeaks Bench evaluates pure knowledge recall of specific factual information. This makes it an interesting test of how well models have encoded niche cultural knowledge in their training data.

🔬 Methodology

Question Set

The benchmark consists of 26 questions ranging from very easy to very hard. Each question has a specific expected answer and is rated on a 5-star difficulty scale:

⭐ Very Easy: General knowledge anyone familiar with the show would know
⭐⭐ Easy: Basic plot points and main character details
⭐⭐⭐ Medium: Specific details requiring careful viewing
⭐⭐⭐⭐ Hard: Obscure facts and minor character details
⭐⭐⭐⭐⭐ Very Hard: Extremely specific details only true fans would recall

Evaluation Modes

Each model is tested in two different modes:

NO SEARCH: Models answer based solely on their internal knowledge from training data. This tests what information they've memorized.

WITH SEARCH: Models can use web search capabilities to look up information. This tests their ability to find and extract correct information.

Scoring System

Each model provides 3 independent answers per question (for statistical robustness). Responses are evaluated by Claude Haiku acting as a judge, which:

Compares the model's answer to the expected answer
Determines if the answer is correct (even if phrased differently)
Provides reasoning for the judgment

Metrics

We report three key metrics for each model:

Accuracy: Overall percentage of questions answered correctly across all 3 trials (out of 78 total attempts per mode)

Pass@1: Percentage of questions where the model got it right on the first try

Pass@3: Percentage of questions where the model got it right at least once in 3 attempts

🤖 Models Tested

We evaluate the latest frontier models from leading AI labs:

Claude Sonnet 4.5

Claude Opus 4.5

GPT-5.1

GPT-5.2

Gemini 3

Gemini Flash

Each model completes 156 total evaluations: 26 questions × 2 modes (no search/with search) × 3 trials = 156 responses per model.

📊 Total Evaluations

The complete benchmark consists of 936 individual test records:

6 models tested
26 questions per model
2 modes (no search/with search)
3 trials per question
= 936 total evaluations

💡 Why Twin Peaks?

Twin Peaks makes an excellent benchmark subject for several reasons:

Culturally significant: Well-known show that likely appears in training data
Rich detail: Complex plot with many specific facts to test
Difficulty spectrum: Easy to create questions ranging from trivial to extremely obscure
Objective answers: Most questions have clear, verifiable correct answers
Limited scope: Finite source material (2 seasons + movie) makes it a contained knowledge domain

🛠️ Technical Details

The evaluation pipeline is built with Python and uses:

OpenAI API: For GPT models
Anthropic API: For Claude models and judge evaluation
Google Gemini API: For Gemini models
SQLite: For storing evaluation history
GitHub Pages: For hosting this website

The full source code, question set, and evaluation results are available on GitHub:

View on GitHub →

🚀 Running Your Own Evaluation

You can run the benchmark yourself with your own API keys:

Clone the repository
Install dependencies: pip install -r requirements.txt
Copy .env.example to .env and add your API keys
Run: python run_full_benchmark.py

See the repository README for detailed instructions.

👤 About the Creator

TwinPeaks Bench was created by @frabert101010 as an experiment in specialized LLM evaluation.

Questions, feedback, or ideas? Open an issue on the GitHub repository!

📖 About TwinPeaks Bench