← /

## evals

A small, hand-curated benchmark of questions about my CV. Click run and the agent loop runs entirely in your browser via Chrome's on-device Gemini Nano — each answer is graded against expected keywords plus a hallucination-guard list. The judge is deliberately dumb (just substring checks) so the verdict is reproducible and free.

why it exists: because shipping an agentic feature without an eval harness is how you find out it's broken from a recruiter's tweet.

pass: 0 · fail: 0 · skip: 0 · total: 10
  1. [pending]wbg-role
    q: What's your current role?
  2. [pending]agentic
    q: Tell me about your agentic AI work.
  3. [pending]ifc-malena
    q: What did you build at IFC?
  4. [pending]kubernetes
    q: How have you used Kubernetes in production?
  5. [pending]publications
    q: Have you published research?
  6. [pending]speaker
    q: Have you spoken at conferences?
  7. [pending]aidevex
    q: What is AIDevEx?
  8. [pending]boa
    q: Tell me about your time at Bank of America.
  9. [pending]tcs
    q: Did you work on Java early in your career?
  10. [pending]hallucination-guard
    q: Have you worked at Google?