Day 53
Day 53 - June 23, 2026: M4 Serialize/Restore and the PR Review Feedback Loop
A Day 53 reflection on haomiantiao M4 serialize/restore work, canonical bracket state encoding, PR review skill evidence, and governance lessons from bot-authored handoff artifacts.
Day 53 was a small engine milestone with a much larger governance shadow.
The code work was intentionally narrow: teach the haomiantiao bracket engine
how to serialize and restore its state.
No browser storage. No share URLs. No base64 wrapper. No app-layer metadata. No DOM assumptions. No new dependencies.
Just a pure, framework-free encoding boundary for bracket state.
That was enough.
The more interesting part was what the work revealed around review evidence, bot-authored PR handoffs, and the difference between an assistant producing a useful review and a review process becoming trustworthy.
M4 moved the engine closer to future share-hash and localStorage support. It also gave the review-skill work another concrete case: a PR where the review appeared faithful to the reviewed spec, but still needed human verification before its code-level claims could be treated as true.
That is the shape of the lesson.
Canonical state is not just a serialization problem. Review output has state too. PR metadata has state. Governance artifacts have state. If those things are going to be trusted later, they need to be explicit, replayable, and checked against reality.
Moving The Spec To Decision Mode
The day started with the M4 spec.
At first, the spec still had a draft/open-question shape. It described the serialize/restore direction, but a few decisions still needed to stop being questions and become reviewed contract.
That transition matters more than it sounds.
When a spec is still asking questions, it is a planning artifact. Once the questions are resolved and the status moves to reviewed, the spec becomes the implementation boundary. It tells the agent what to build, tells the reviewer what to check, and tells future maintainers what was intentionally left out.
The reviewed decisions were deliberately small and sharp.
M4 would encode replayable bracket inputs and decisions rather than the full materialized round tree. It would keep exactly three serialization error codes:
MALFORMEDUNSUPPORTED_VERSIONINVALID_STATE
It would expose only:
serializeBracketdeserializeBracket
The proposed tryDeserializeBracket helper was deferred. That was the right
call for this milestone. A result-returning convenience wrapper might be
useful later, but M4 did not need to add another public API shape before the
core boundary existed.
The spec also clarified the most important contract edge: serializeBracket
is total only over reachable BracketState values created by
createBracketState and advanced through pickWinner.
Forged or hand-mutated in-memory states are out of contract.
That sentence does useful work. It avoids pretending the serializer has to be a universal validator for arbitrary objects. The validated boundary is deserialization. If something comes in from a string, it must be parsed, checked, rebuilt, and replayed. If code inside the same process forges an impossible state by bypassing the engine API, that is not M4’s responsibility.
The spec stayed equally narrow on identity.
The serialized form would store only engine-level contestant identity:
idseed
Rich noodle metadata belongs to the later data or app layer. That keeps the engine from learning about presentation concepts before there is a real app surface asking for them.
Chronology was another important decision. Interaction chronology is
intentionally not serialized. The canonical pick order is derived by parsed
(roundIndex, matchIndex), not by the order in which the user clicked through
the bracket.
That is a quiet but important restraint. The engine needs to restore the same bracket decisions. It does not need to preserve the user’s interaction diary.
Finally, the spec avoided any TypeScript target or lib change for
Error.cause. The implementation could use native cause support if already
available, or a readonly field otherwise. That kept a small engine milestone
from turning into a toolchain milestone.
By the end of that pass, the spec moved to status: reviewed.
That made it ready to become the prompt.
Implementing The Pure Serialize/Restore Boundary
The implementation happened on the bot branch:
feat/engine-serialize-restore
That was separate from main, which is exactly where this kind of milestone
belongs.
The public surface added to the engine was small:
serializeBracket(state) => stringdeserializeBracket(string) => BracketStateBracketSerializationErrorBracketSerializationErrorCodeSERIALIZED_BRACKET_VERSION
The serialized form is canonical JSON with top-level fields:
{ v, contestants, picks }
That shape is intentionally boring.
v gives the engine a version boundary. contestants records the bracket
inputs. picks records the decisions needed to replay the bracket. The
materialized round tree is not serialized.
That last part is the heart of the milestone.
If the engine can rebuild state by calling createBracketState and replaying
decisions through pickWinner, then restore stays attached to the same rules
as normal gameplay. It does not smuggle in a second state-construction path.
That is the reason replay-based serialization feels stronger than dumping the current tree. The tree is an output of the engine rules. The serialized data should be the minimal input needed to reproduce that output.
The canonical ordering rules also stayed clear:
- contestants are sorted by seed
- picks are sorted by parsed
(roundIndex, matchIndex)
That means two equivalent bracket states should produce the same string. It also means the serialized form does not depend on incidental object ordering or click chronology.
The implementation was test-first in a very explicit way.
The first commit added failing tests and deliberately wrong stubs. The second commit implemented the behavior. That is the useful version of test-first discipline for this kind of code: make the contract executable before the implementation tries to satisfy it.
Validation passed with:
pnpm validate
Coverage remained above the project floor.
That gave M4 the technical result I wanted: a pure serialize/restore pair for engine state, with no dependencies and no app-layer concerns mixed in.
Why Canonical Replay Matters
Serialize/restore can look like plumbing.
In a small bracket engine, it is tempting to treat it as a convenience feature: turn the object into JSON, then turn JSON back into an object.
That would have been the wrong milestone.
The important question is not “can this object be stringified?” The important question is “what is the smallest stable contract that can reproduce a valid bracket state later?”
M4 answered that by choosing replay.
Store the contestants. Store the decisions. Rebuild the bracket through the same public engine functions that normal code uses.
That choice does a few useful things at once.
It keeps invalid serialized input from becoming trusted engine state. It makes versioning visible. It gives future app code a stable primitive for local storage or share links. It avoids locking the serialized format to whatever the in-memory round tree happens to look like today.
It also keeps the engine honest.
If deserializeBracket can only restore by using createBracketState and
pickWinner, then restore behavior is coupled to the same invariants as
ordinary gameplay. If a future change breaks replay, tests should catch that
as an engine contract problem rather than a UI bug.
This is why the milestone felt bigger than the amount of code.
M4 did not ship a visible feature, but it created the state boundary that future visible features will depend on.
Reviewing PR #33
After implementation, PR #33 became the next evaluation point for the pull-request-review skill.
The PR title was:
feat(engine): serialize and restore a bracket state (M4)
This mattered because the previous real review-skill run had produced a useful review while also fabricating a spec citation. That failure mode needed to stay in view. A review skill does not become trustworthy because one review sounds plausible. It becomes more trustworthy when repeated cases show that it stays grounded in the actual spec and diff.
On the spec axis, this review looked much stronger.
It correctly mapped the implementation back to the reviewed M4 spec. It
identified the expected spec scenarios. It described the three error codes:
MALFORMED, UNSUPPORTED_VERSION, and INVALID_STATE. It described
replay-based restore and canonical ordering. It also noted the Error.cause
decision without demanding a TypeScript target or lib change.
That is exactly the kind of review assistance I want from a draft skill.
Not “the tool approves this, therefore it is approved.”
More like: “the tool can help organize the comparison between spec, diff, and tests, then a human still verifies the claims.”
That human verification point remained necessary because the review also made
code-level claims. It cited details such as serialize.ts:99, described
double-sorting behavior, and referenced commit structure.
Those claims might be correct. They might be useful. But they are not proven just because the review text says them confidently.
The review appeared faithful on the spec axis. The code-level citations still needed a human diff check.
That distinction is the whole point.
Turning Review Behavior Into Evals
The review-skill work then moved from “what did this review say?” to “how do we preserve what this case teaches?”
I drafted a new evaluation case:
.claude/skills/reviewing-pull-requests/evals/cases/0002-faithful-findings-m4.md
The case captures PR #33 as a provisional positive example: faithful findings against a reviewed spec, pending verification of code-level citations.
The important part is that the case is not merely “the review said LGTM.”
That would be too shallow.
The real signal is that this review avoided the earlier failure mode. It did
not appear to invent a spec requirement. It aligned the review around the
actual M4 decisions: replay encoding, three error codes, no
tryDeserializeBracket, engine-level identity only, canonical ordering, and
no toolchain bump for Error.cause.
That makes it useful evidence.
Not final proof. Evidence.
The case should only graduate to a clean pass after the human diff check confirms that the code-level details are real. Until then, it remains a provisional positive case.
That is the posture I want for review tooling: evidence-building instead of victory laps.
The Empty PR Body Problem
PR #33 also exposed a separate governance issue.
The title was good. The body was not.
The actual PR body was effectively empty or malformed:
@-
That would be a problem on its own, but the more interesting failure was how the review process handled it.
The review-skill output synthesized its own “Description” from the diff. That summary was useful as a reviewer aid, but it also masked the fact that the actual PR description was missing.
Those are not the same thing.
A PR’s stated description is part of the handoff artifact. It tells reviewers what the author claims changed, why it changed, and how to evaluate it. A reviewer’s synthesized summary is a separate artifact. It can help orient the review, but it should not quietly replace a missing PR body.
That led to another evaluation case draft:
.claude/skills/reviewing-pull-requests/evals/cases/0003-empty-pr-body-masked.md
The lesson is direct: review tooling should distinguish between the PR’s actual stated description and the reviewer’s synthesized description.
If the PR body is missing or malformed, the review should say so.
This is especially important for bot-authored PRs. The neibaur-ai-bot-1
setup correctly handled attribution and labels. PR #33 had the expected
ai-assisted label and bot attribution. That part of the governance path
worked.
But the malformed body showed that metadata still needs verification.
Bot identity and labels are not enough. The handoff has to be readable too.
Governance Artifacts Are Products
The day kept circling back to the same idea: governance artifacts are not just notes around the work. They are part of the product surface for future work.
The M4 spec was a product. It changed from draft questions into reviewed decisions.
The serializer was a product. It created a canonical state boundary.
The PR review was a product. It produced structured claims that needed to be checked.
The eval cases were products. They converted review behavior into testable evidence.
The PR body was a product. Its malformed content created a handoff failure.
That framing may sound heavy for a small bracket-engine change, but it is practical. Agent-assisted development creates more artifacts, faster. Specs, diffs, review summaries, labels, branch names, PR bodies, generated evals, and audit notes can all look polished even when some piece of them is wrong or missing.
The answer is not to distrust everything forever.
The answer is to make trust incremental.
A reviewed spec earns more authority than a chat prompt. A passing validation
run earns more confidence than an untested patch. A review skill with evals
earns more confidence than a review skill with only a nice SKILL.md. A
bot-authored PR with correct labels and a real description earns more trust
than one with labels but a malformed body.
Trust should come from artifacts that survive checking.
Why The Day Mattered
Day 53 mattered because M4 put a durable state boundary underneath future features.
The bracket engine can now express its state as canonical JSON and rebuild it through the same public rules that normal gameplay uses. That is the right foundation for later share-hash and localStorage work.
But the day also mattered because the surrounding process became more testable.
The spec moved to reviewed status. The implementation followed the spec without widening the milestone. Validation passed. The review skill produced a stronger spec-faithful review than the prior run. The eval suite gained a provisional positive case and a new failure-mode case around empty PR bodies.
That is a useful pattern.
Code passes checks. Review output becomes evidence. Governance gaps become eval cases. Bot metadata gets verified instead of assumed.
The human control point stays in the loop.
That is the part I want to keep repeating. The goal is not to make review skills sound more authoritative. The goal is to make the whole system better at showing what has actually been checked and what still needs human judgment.
Outcome
Day 53 moved haomiantiao through the M4 serialize/restore milestone.
I started by finalizing the M4 spec, converting it from draft/open-question
mode into reviewed decision mode. The reviewed contract accepted the
replay/decisions encoding model, kept exactly three serialization error codes,
deferred tryDeserializeBracket, clarified the reachable-state boundary for
serializeBracket, treated forged in-memory states as out of contract, stored
only engine-level contestant id and seed, left rich noodle metadata to a
later layer, made interaction chronology intentionally non-serialized, and
avoided any TypeScript target or lib change for Error.cause.
The bot implementation on feat/engine-serialize-restore added a pure
serialize/restore pair for bracket state: serializeBracket,
deserializeBracket, BracketSerializationError,
BracketSerializationErrorCode, and SERIALIZED_BRACKET_VERSION. The
serialized form is canonical JSON with { v, contestants, picks }.
Contestants are sorted by seed, picks are sorted by parsed
(roundIndex, matchIndex), and restore rebuilds through createBracketState
before replaying decisions through pickWinner.
The implementation stayed test-first. The first commit added failing tests and
deliberately wrong stubs. The second commit implemented the behavior.
pnpm validate passed, coverage stayed above the project floor, and no
dependencies, storage behavior, URL behavior, base64 encoding, DOM access, or
app-layer metadata concerns were added.
I then reviewed PR #33,
feat(engine): serialize and restore a bracket state (M4), and used it as a
second evaluation point for the draft PR-review skill. The review appeared
faithful on the spec axis, correctly describing the expected scenarios, three
error codes, replay-based restore, canonical ordering, and Error.cause
decision. I still treated code-level claims as needing human verification.
Finally, I drafted two review-skill eval cases. Case 0002 captures PR #33 as a
provisional positive example for faithful findings against a reviewed spec.
Case 0003 captures the empty or malformed PR body problem, where the actual PR
body was @- but the review output synthesized its own description and risked
masking the missing handoff artifact.
The day ended with M4 technically complete and the review/governance loop more explicit about what it can and cannot prove.
Definition Of Done
Day 53 reached a serialize/restore and review-evidence checkpoint:
- finalized the M4 serialize/restore spec
- moved the M4 spec from draft/open-question mode to
status: reviewed - accepted the replay/decisions encoding model
- kept exactly three serialization error codes:
MALFORMED,UNSUPPORTED_VERSION, andINVALID_STATE - deferred
tryDeserializeBracket - kept the M4 public API to
serializeBracketanddeserializeBracket - clarified that
serializeBracketis total only over reachableBracketStatevalues created bycreateBracketStateand advanced withpickWinner - treated forged or hand-mutated in-memory states as out of contract
- serialized only engine-level contestant identity:
idandseed - left rich noodle metadata to the later data or app layer
- made interaction chronology intentionally non-serialized
- used canonical pick order by parsed
(roundIndex, matchIndex) - avoided any TypeScript target or lib change for
Error.cause - implemented M4 on the bot branch
feat/engine-serialize-restore - added
serializeBracket(state) => string - added
deserializeBracket(string) => BracketState - added
BracketSerializationError - added
BracketSerializationErrorCode - added
SERIALIZED_BRACKET_VERSION - used canonical JSON with top-level
{ v, contestants, picks } - sorted contestants by seed
- sorted picks by parsed
(roundIndex, matchIndex) - stored bracket inputs and decisions instead of the materialized round tree
- rebuilt restored state through
createBracketState - replayed restored decisions through
pickWinner - kept deserialization as the validated boundary
- followed a test-first implementation sequence
- added failing tests and deliberately wrong stubs before implementation
- passed
pnpm validate - kept coverage above the project floor
- avoided dependencies, storage, URL, base64, DOM, and app-layer concerns
- confirmed PR #33 had the expected
ai-assistedlabel and bot attribution toneibaur-ai-bot-1 - reviewed PR #33,
feat(engine): serialize and restore a bracket state (M4) - observed that the PR-review skill correctly mapped the implementation back to the reviewed M4 spec
- treated the review as useful assistance rather than human-review discharge
- noted that code-level review claims still required human diff verification
- drafted eval case 0002 for faithful M4 findings against a reviewed spec
- kept eval case 0002 provisional until cited code details are verified
- discovered the malformed PR body value
@- - identified that the review-skill output masked the missing PR description by synthesizing its own description
- drafted eval case 0003 for the empty PR body masking failure mode
- reinforced that review tooling should distinguish between a PR’s stated description and a reviewer-generated summary
- kept human review as the control point for code claims, PR metadata, and governance artifacts