EP 02 Jul 30, 2024 1h 23m

Dhruv Singh on LLM evaluation at scale

With Dhruv Singh, Founder, Honeyhive.ai

A long conversation on LLM evaluation in production — vibe checks vs. principled metrics, Bradley-Terry scoring, chain-of-thought prompting, wrapper companies, and the uncomfortable question of P(doom).

Show notes

  • 00:00 — Highlights
  • 04:13 — Introduction and greetings
  • 07:57 — Exploring Honeyhive’s origins
  • 12:44 — Founding Honeyhive and early challenges
  • 15:08 — Honeyhive’s approach and product philosophy
  • 15:57 — Open source vs. closed source LLM Ops
  • 21:00 — Llama’s latest release and evaluation
  • 24:56 — Challenges in AI evaluation and the future
  • 32:03 — What are vibe checks?
  • 36:44 — RLHF, Bradley–Terry (ELO), etc.
  • 44:47 — Custom-built eval models and their advantages
  • 45:53 — Challenges in using LLMs for evaluation
  • 46:48 — P(doom)
  • 48:02 — The future of AGI
  • 49:46 — Chatbots vs. agents
  • 55:52 — The role of wrapper companies
  • 58:39 — Chain of thought (CoT) prompting outperforms
  • 1:03:01 — Scaling LLM evaluation
  • 1:05:39 — Synthetic data and its impact
  • 1:11:00 — Advice for implementing LLM evaluations
  • 1:17:28 — Benchmarking metrics: Bradley–Terry, ELO
  • 1:22:44 — Concluding thoughts