Dhruv Singh on LLM evaluation at scale

A long conversation on LLM evaluation in production — vibe checks vs. principled metrics, Bradley-Terry scoring, chain-of-thought prompting, wrapper companies, and the uncomfortable question of P(doom).

Show notes

00:00 — Highlights
04:13 — Introduction and greetings
07:57 — Exploring Honeyhive’s origins
12:44 — Founding Honeyhive and early challenges
15:08 — Honeyhive’s approach and product philosophy
15:57 — Open source vs. closed source LLM Ops
21:00 — Llama’s latest release and evaluation
24:56 — Challenges in AI evaluation and the future
32:03 — What are vibe checks?
36:44 — RLHF, Bradley–Terry (ELO), etc.
44:47 — Custom-built eval models and their advantages
45:53 — Challenges in using LLMs for evaluation
46:48 — P(doom)
48:02 — The future of AGI
49:46 — Chatbots vs. agents
55:52 — The role of wrapper companies
58:39 — Chain of thought (CoT) prompting outperforms
1:03:01 — Scaling LLM evaluation
1:05:39 — Synthetic data and its impact
1:11:00 — Advice for implementing LLM evaluations
1:17:28 — Benchmarking metrics: Bradley–Terry, ELO
1:22:44 — Concluding thoughts