At the Design and Partnership Lab, I worked on evaluating large language models in an educational context where outputs directly affect human judgment. My work focused on building and assessing an LLM based rubric scoring pipeline used to support educators. One of my first responsibilities was curating a labeled dataset of over 120 ground truth samples spanning a twelve domain rubric. This process exposed me to the inherent ambiguity in qualitative evaluation and the challenges of establishing consistent labels across evaluators.
Beyond dataset creation, I designed evaluation frameworks that measured not only accuracy but also consistency and failure patterns across repeated model runs. I developed domain specific prompting strategies and structured evaluation criteria that captured five competency dimensions, allowing us to compare qualitative feedback stability rather than relying solely on aggregate scores. This work highlighted how traditional machine learning metrics often fail to reflect user facing reliability and fairness.
I also implemented a human in the loop evaluation dashboard using React Next.js and Prisma. The goal was to support auditability, manual overrides, and traceability for every model generated score. Knowing that the system would be used by over a thousand educators serving tens of thousands of students shaped many of my design decisions. This experience reinforced my interest in trustworthy machine learning systems and in building tooling that augments rather than replaces human judgment.