EP 02 Jul 30, 2024 1h 23m
Dhruv Singh on LLM evaluation at scale
With Dhruv Singh, Founder, Honeyhive.ai
A long conversation on LLM evaluation in production — vibe checks vs. principled metrics, Bradley-Terry scoring, chain-of-thought prompting, wrapper companies, and the uncomfortable question of P(doom).
Show notes
- 00:00 — Highlights
- 04:13 — Introduction and greetings
- 07:57 — Exploring Honeyhive’s origins
- 12:44 — Founding Honeyhive and early challenges
- 15:08 — Honeyhive’s approach and product philosophy
- 15:57 — Open source vs. closed source LLM Ops
- 21:00 — Llama’s latest release and evaluation
- 24:56 — Challenges in AI evaluation and the future
- 32:03 — What are vibe checks?
- 36:44 — RLHF, Bradley–Terry (ELO), etc.
- 44:47 — Custom-built eval models and their advantages
- 45:53 — Challenges in using LLMs for evaluation
- 46:48 — P(doom)
- 48:02 — The future of AGI
- 49:46 — Chatbots vs. agents
- 55:52 — The role of wrapper companies
- 58:39 — Chain of thought (CoT) prompting outperforms
- 1:03:01 — Scaling LLM evaluation
- 1:05:39 — Synthetic data and its impact
- 1:11:00 — Advice for implementing LLM evaluations
- 1:17:28 — Benchmarking metrics: Bradley–Terry, ELO
- 1:22:44 — Concluding thoughts