LLM Evaluation UX Research Study - Team of 3

Tasked by an LLM evaluation software startup with interviewing project managers and software engineers to identify how teams are currently evaluatig their current LLMs, essential features needed for an LLM tool and the current market of evaluation tools.

Methods:

  • Competitive Teardown

  • In - Depth Interviews

Research areas included:

  • Current Landscape: How are TPMs and SWEs currently owning AI features?

  • Development Process: How do TPMs and SWEs determine the minimum requirements for an AI agent to launch?

  • LLM Evaluation + Iteration: How do software teams evaluate AI agents? What are unmet needs and untapped opportunities in this evaluation / iteration process?

  • Evaluation Tooling Expectations: What do users expect from the tools they use to evaluate their AI agents? What are barriers to adopting a new LLM eval tool? What are MVP requirements?

Conducted 2 user interviews out of 6 with 1 technical project manager and 1 software engineer. Delivered and presented 4 key findings to startup founder detailing the current LLM Evlauation Land Scape, minimum requirements needed to launch, current LLM Evlauation pricesses and LLM Tool expectations.

Completed Exectuive Summary and current LLM Evaluation market landscape

Executive Summary Key Findings:

  • Target users are TPMs, SWEs, and PMs: TPMs and SWEs drive AI feature implementation and testing, and PMs deploying AI agents in as consultants.

  • AI evaluation is largely manual, leaving edge cases unaddressed and making quantitative metrics difficult for non-TPMs to define.

  • No standardized AI launch criteria: Most features are manually tested and iterated postdeployment.

    • This creates an opportunity to define and deliver minimum launch requirements for users

  • AI Evaluation tools need to be time saving, reduce token usage and account for edge cases.

Final Report:

Collaborators on project: Luke Fitzpatrick, Vikki Wong

Final Report