📖 About TwinPeaks Bench
🎯 What is TwinPeaks Bench?
TwinPeaks Bench is a specialized LLM evaluation benchmark that tests how well large language models know obscure details from the cult classic TV series Twin Peaks. It consists of 26 carefully curated questions covering characters, plot details, and memorable moments from the show.
Unlike traditional benchmarks that test reasoning or coding abilities, TwinPeaks Bench evaluates pure knowledge recall of specific factual information. This makes it an interesting test of how well models have encoded niche cultural knowledge in their training data.
🔬 Methodology
Question Set
The benchmark consists of 26 questions ranging from very easy to very hard. Each question has a specific expected answer and is rated on a 5-star difficulty scale:
- ⭐ Very Easy: General knowledge anyone familiar with the show would know
- ⭐⭐ Easy: Basic plot points and main character details
- ⭐⭐⭐ Medium: Specific details requiring careful viewing
- ⭐⭐⭐⭐ Hard: Obscure facts and minor character details
- ⭐⭐⭐⭐⭐ Very Hard: Extremely specific details only true fans would recall
Evaluation Modes
Each model is tested in two different modes:
NO SEARCH: Models answer based solely on their internal knowledge from training data. This tests what information they've memorized.
WITH SEARCH: Models can use web search capabilities to look up information. This tests their ability to find and extract correct information.
Scoring System
Each model provides 3 independent answers per question (for statistical robustness). Responses are evaluated by Claude Haiku acting as a judge, which:
- Compares the model's answer to the expected answer
- Determines if the answer is correct (even if phrased differently)
- Provides reasoning for the judgment
Metrics
We report three key metrics for each model:
Accuracy: Overall percentage of questions answered correctly across all 3 trials (out of 78 total attempts per mode)
Pass@1: Percentage of questions where the model got it right on the first try
Pass@3: Percentage of questions where the model got it right at least once in 3 attempts
🤖 Models Tested
We evaluate the latest frontier models from leading AI labs:
Claude Sonnet 4.5
Claude Opus 4.5
GPT-5.1
GPT-5.2
Gemini 3
Gemini Flash
Each model completes 156 total evaluations: 26 questions × 2 modes (no search/with search) × 3 trials = 156 responses per model.
📊 Total Evaluations
The complete benchmark consists of 936 individual test records:
- 6 models tested
- 26 questions per model
- 2 modes (no search/with search)
- 3 trials per question
- = 936 total evaluations
💡 Why Twin Peaks?
Twin Peaks makes an excellent benchmark subject for several reasons:
- Culturally significant: Well-known show that likely appears in training data
- Rich detail: Complex plot with many specific facts to test
- Difficulty spectrum: Easy to create questions ranging from trivial to extremely obscure
- Objective answers: Most questions have clear, verifiable correct answers
- Limited scope: Finite source material (2 seasons + movie) makes it a contained knowledge domain
🛠️ Technical Details
The evaluation pipeline is built with Python and uses:
- OpenAI API: For GPT models
- Anthropic API: For Claude models and judge evaluation
- Google Gemini API: For Gemini models
- SQLite: For storing evaluation history
- GitHub Pages: For hosting this website
The full source code, question set, and evaluation results are available on GitHub:
View on GitHub →
🚀 Running Your Own Evaluation
You can run the benchmark yourself with your own API keys:
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
- Copy
.env.example to .env and add your API keys
- Run:
python run_full_benchmark.py
See the repository README for detailed instructions.
👤 About the Creator
TwinPeaks Bench was created by @frabert101010 as an experiment in specialized LLM evaluation.
Questions, feedback, or ideas? Open an issue on the GitHub repository!