## evals
A small, hand-curated benchmark of questions about my CV. Click run and the agent loop runs entirely in your browser via Chrome's on-device Gemini Nano — each answer is graded against expected keywords plus a hallucination-guard list. The judge is deliberately dumb (just substring checks) so the verdict is reproducible and free.
why it exists: because shipping an agentic feature without an eval harness is how you find out it's broken from a recruiter's tweet.
pass: 0 · fail: 0 · skip: 0 · total: 10
- [pending]wbg-roleq: What's your current role?
- [pending]agenticq: Tell me about your agentic AI work.
- [pending]ifc-malenaq: What did you build at IFC?
- [pending]kubernetesq: How have you used Kubernetes in production?
- [pending]publicationsq: Have you published research?
- [pending]speakerq: Have you spoken at conferences?
- [pending]aidevexq: What is AIDevEx?
- [pending]boaq: Tell me about your time at Bank of America.
- [pending]tcsq: Did you work on Java early in your career?
- [pending]hallucination-guardq: Have you worked at Google?