Two graders. Regex first: the deterministic checks that confirm the obvious things happened. Claude second: reads the transcript against the five-dim rubric. Both produce a score. If they disagree by more than 8 points we hold the cert and route it to a human reviewer. The grader never reads the transcript as a judgment of you, only of the work.
inter-rater reliability 0.91 vs senior reviewer, method versioned, changes are public
Submit fires a transcript snapshot: all keystrokes, all exits, all stdout, plus the final ticket-reply box. Stored on R2, signed.
Deterministic pattern bank runs first: "did dsquery fire", "did the OU change", "is spooler running". Each match awards or denies fixed points. Yes / no. No judgment.
Same transcript, same rubric. Claude scores the five dimensions against the four bands; prose justification per dim with line citations.
Regex score and Claude score are compared. Within 8 points: we ship the higher. Beyond 8: cert is held; a human reviewer reads the transcript next business day.
Pass: cert is hashed, written to the public ledger, profile page rebuilds. Fail: learn-hub recommendations dispatched, no entry to the ledger.
Confirms the candidate started by checking name resolution. Sets up the OU hypothesis.
Stdout match, evidence the printer was not in the expected OU. Used as anchor for downstream patterns.
AD query, the right tool for finding misplaced AD objects. Tool fluency +1 band.
Reversible move to the correct OU. The actual fix, methodology +1 band.
Service restart confirms the post-move cleanup step happened.
Ticket-reply body contains plain-language confirmation. Communication anchor.
Destructive command outside scope. Auto-penalty, real-world fit, cap at band 1.
regex bank is published per scenario, cert page links to the exact bank that graded you, changes are versioned
The grading prompt is a fixed system message: published, versioned, MIT-licensed. The user message is the transcript and the rubric, nothing else. Claude does not see your name, your resume, your demographic data, your prior scores, your geography, or your previous transcripts. The grader is stateless by contract.
You are grading an IT operations
transcript against a 5-dimension rubric.
You will read:
- the scenario prompt
- the transcript of commands + stdout
- the candidate's ticket-reply text
- the rubric (5 dim x 4 bands)
You will produce:
- 1 score per dim (0..3)
- 1 prose justification per dim
citing line numbers in transcript
- no extra commentary
You will not:
- score speed
- score personality
- infer demographics
- use any prior transcript
If the transcript contains
content unrelated to the task
(personal commentary, conduct
remarks) you will ignore it.
You are grading the work.
You are not judging the person.Most of the time Claude and regex agree within a few points: regex is the floor, Claude reads the texture above it. When they are 9+ points apart something interesting happened: usually a creative solution regex did not anticipate, or a clean-looking transcript that misses the real fix. Either way: we do not ship that cert automatically. A senior human reviews the transcript the next business day.
The grader sees one transcript at a time. No prior attempts, no profile, no demographic data, no name. Stateless by contract.
The grading prompt is fixed and public. The grader cannot be asked to read the transcript for anything except the rubric.
Every score includes line citations. No citation, no points. Vibes do not produce points. Vibes do not subtract points.
If the candidate vents in-shell, the grader ignores it. We grade the ticket. The shell is not your social-media account.
I have not had to defend a single cert in front of a hiring manager, because the rubric defends itself.
recruiting ops, 4,000-seat IT staffing firm, 2025grading method versioned, public, the same one a recruiter-paid scenario uses