project / 01

Voice-of-customer agent

extracting on-demand insights from customer meeting transcripts

Why I made this

note

As the lead of product I wanted to stay as close as possible to all my customers, even when the volume of interactions meant I could not be in the room for every conversation. I wanted to learn insights from meetings in a way that was quick and accurate. So I created this agent. I used it myself and later I shared it internally with my org. This agent ended up being used quite widely and regularly because it was flexible and provided different teams with what they needed to learn from our customers.

  • GTM team could ask: Did we follow the GTM playbook and open with a discovery of the customers end goal in the last 2 customer calls?
  • Product team could ask: What painpoints did we uncover in the Tuesday meeting with Sarah from SynBio?

And they would receive responses with reference links to specific parts of the meeting transcript that they could click on and watch.

The Imact

  1. Lengthy conversations boiled down to key insights in a structured way
  2. Data-driven decision making made possible in the early days of product growth (we could ask questions about multiple customer calls in aggregate and get quantitative response)
  3. Voice-of-customer was democratized across the company. Anyone could ask the agent their own questions and learn directly from customer's voice.
md01 / 03

Eval Harness and Metrics

note

I used an eval harness to set up tasks with deterministic inputs that created outputs and transcripts. I ran many trials of those tasks, each with a grader. My graders were either Code graders, LLM-as-a-judge graders or ultimately human graders. I would divide the phases into two (1) pre-release evals (2) post-release on-going evals.

Pre-release evals

Here I started by defining what can be checked by a code grader cheaply and reliably? What needs judgement from an LLM or human and what rubric should be used that gets a high level of agreement (kappa>0.6) between the LLM and the human expert (benevolant dictator - aka myself). Here is how I ended up using different graders:

Code grader

  • Did sub-agent use transcript-fetch tool? (Yes/No)
  • Did agent send a response schema to sub-agent? (Yes/No)
  • Did agents response include a reference to a transcript section? (Yes/No)

LLM grader

  • Is the reference relevant to the question asked? (Yes/No)
  • Was agents answer formatted according to the question? (Yes/No)
  • Did agents response address user's question? (Yes/No)

Human grader

working through many cases, identifying failure modes. Then improving the LLM's rubric accordingly. Some failure modes included:

  • Hallucination
  • Failure to ask clarifying questions
  • Agent failed to appropriately scope the question for sub-agents

Post-release evals

  • Code graders became a gate for every PR merge
  • LLM graders provided feedback to the agentic loop to re-try (3 retries max), If it failed after 3 retried agent responded with a report saying why answer is uncertain
  • Sporatic human spot-checks
  • Human (myself) checked traces with feedback from my team
  • (TODO) adding explicit thumbs up/down feature on each response

Metrics

  • Outcome metrics: task completion rate
  • Quality metrics: failure rate for coder & LLM graders
  • Ops metrics: latency p95, cost-per-task
md02 / 03
03 / diagram

ask me anything sequence

Detailed sequence

system diagram03 / 03