opsticket / assess·overview tracks rubric certificates method learn hub terminal rush leaderboard custom scenario skill gap templates

surface 05, the method

How a scenario actually gets graded.

Two graders. Regex first: the deterministic checks that confirm the obvious things happened. Claude second: reads the transcript against the five-dim rubric. Both produce a score. If they disagree by more than 8 points we hold the cert and route it to a human reviewer. The grader never reads the transcript as a judgment of you, only of the work.

try a scenario, graded live see the rubric

inter-rater reliability 0.91 vs senior reviewer, method versioned, changes are public

the pipeline

What happens between submit and the cert page.

step_01

Capture

Submit fires a transcript snapshot: all keystrokes, all exits, all stdout, plus the final ticket-reply box. Stored on R2, signed.

transcript, signed, 12mb avg

step_02

Regex

Deterministic pattern bank runs first: "did dsquery fire", "did the OU change", "is spooler running". Each match awards or denies fixed points. Yes / no. No judgment.

objective, 80-120 patterns per scenario

step_03

Claude

Same transcript, same rubric. Claude scores the five dimensions against the four bands; prose justification per dim with line citations.

structured output, rubric-cited

step_04

Reconcile

Regex score and Claude score are compared. Within 8 points: we ship the higher. Beyond 8: cert is held; a human reviewer reads the transcript next business day.

delta <= 8 auto, else human

step_05

Mint

Pass: cert is hashed, written to the public ledger, profile page rebuilds. Fail: learn-hub recommendations dispatched, no entry to the ledger.

sha-256, ledger block, profile rebuild

the regex bank

Step 02, the deterministic part. scenario_03.

/nslookup\s+printsrv-04/

Confirms the candidate started by checking name resolution. Sets up the OU hypothesis.

+4 / accuracy

/NXDOMAIN/

Stdout match, evidence the printer was not in the expected OU. Used as anchor for downstream patterns.

context, 0

/dsquery\s+computer\s+-name\s+printsrv\*/

AD query, the right tool for finding misplaced AD objects. Tool fluency +1 band.

+5 / tools

/Move-ADObject.*-TargetPath\s+OU=Printers/

Reversible move to the correct OU. The actual fix, methodology +1 band.

+6 / method

/Restart-Service\s+spooler/

Service restart confirms the post-move cleanup step happened.

+3 / accuracy

/ticket\.reply.{12,}back online/i

Ticket-reply body contains plain-language confirmation. Communication anchor.

+2 / comm

/(sudo\s+)?rm\s+-rf/

Destructive command outside scope. Auto-penalty, real-world fit, cap at band 1.

-10 / fit

regex bank is published per scenario, cert page links to the exact bank that graded you, changes are versioned

step 03, what the model reads

The model reads the transcript. Not you.

The grading prompt is a fixed system message: published, versioned, MIT-licensed. The user message is the transcript and the rubric, nothing else. Claude does not see your name, your resume, your demographic data, your prior scores, your geography, or your previous transcripts. The grader is stateless by contract.

System prompt: public, in our repo, changes are commits.
User payload: transcript + rubric + scenario id, that is it.
Temperature 0.1, top_p 1.0, output is structured json.
Justifications cite transcript line numbers, not vibes.
Re-graded if the rubric ever changes; old certs versioned to the rubric they were graded against.

// system_prompt.v2.0, public

You are grading an IT operations
transcript against a 5-dimension rubric.

You will read:
  - the scenario prompt
  - the transcript of commands + stdout
  - the candidate's ticket-reply text
  - the rubric (5 dim x 4 bands)

You will produce:
  - 1 score per dim (0..3)
  - 1 prose justification per dim
    citing line numbers in transcript
  - no extra commentary

You will not:
  - score speed
  - score personality
  - infer demographics
  - use any prior transcript

If the transcript contains
content unrelated to the task
(personal commentary, conduct
remarks) you will ignore it.

You are grading the work.
You are not judging the person.

opsticket.com/internal/reviewer, disagreement queue

OPS-NET-22067184·13held, review queue, human reviewer assigned

OPS-HD-11827882·4auto, pass, shipped at 82

OPS-CSEC-5576260·2auto, fail, shipped at 60

OPS-LIN-09418879·9held, review queue, human reviewer assigned

OPS-AI-01179189·2auto, pass, shipped at 89

certregexmodeldeltaaction

step 04, reconciliation

When the two disagree. The cert is held.

Most of the time Claude and regex agree within a few points: regex is the floor, Claude reads the texture above it. When they are 9+ points apart something interesting happened: usually a creative solution regex did not anticipate, or a clean-looking transcript that misses the real fix. Either way: we do not ship that cert automatically. A senior human reviews the transcript the next business day.

<= 8 point delta, ship the higher; candidate sees both numbers on their result page.
9+ point delta, held; email goes out within 30 minutes: "we are reviewing your transcript."
Reviewer turnaround, 1 business day, 95th percentile under 4 hours weekday.
Reviewer overrides feed back into the regex bank; scenarios get better over time.

the conduct rule

The work, not the worker. The rule we will not break.

rule_01

Stateless

The grader sees one transcript at a time. No prior attempts, no profile, no demographic data, no name. Stateless by contract.

rule_02

Bounded

The grading prompt is fixed and public. The grader cannot be asked to read the transcript for anything except the rubric.

rule_03

Cite-or-skip

Every score includes line citations. No citation, no points. Vibes do not produce points. Vibes do not subtract points.

rule_04

No conduct

If the candidate vents in-shell, the grader ignores it. We grade the ticket. The shell is not your social-media account.

“

I have not had to defend a single cert in front of a hiring manager, because the rubric defends itself.

recruiting ops, 4,000-seat IT staffing firm, 2025

0.91inter-rater reliability vs
a senior human reviewer

Read the system prompt. Then submit a transcript.

grading method versioned, public, the same one a recruiter-paid scenario uses

start free assessment ->talk to us

// loading

How a scenario actually gets graded.

inter-rater reliability 0.91 vs senior reviewer, method versioned, changes are public

You are grading an IT operations transcript against a 5-dimension rubric. You will read: - the scenario prompt - the transcript of commands + stdout - the candidate's ticket-reply text - the rubric (5 dim x 4 bands) You will produce: - 1 score per dim (0..3) - 1 prose justification per dim citing line numbers in transcript - no extra commentary You will not: - score speed - score personality - infer demographics - use any prior transcript If the transcript contains content unrelated to the task (personal commentary, conduct remarks) you will ignore it. You are grading the work. You are not judging the person.