In May 2026 the SEC issued a proposal that would allow public companies to switch from quarterly (Form 10-Q) to semi-annual financial reporting (a new Form 10-S). From the date of publication, the public has 60 days to comment on the proposal.
This site was produced by Professor Tzachi Zach at The Ohio State University Fisher College of Business as a public service to track the comment letters as they arrive, classify their positions, and surface patterns in the docket. A number of methodological decisions had to be made along the way — how to classify stance, how to bucket commenters by entity type, how to handle hedged or conditional letters — those decisions are explained below.
Beyond the tracker itself, the project has two other goals. First, I hope it will encourage discussion among accounting academics about the proposal. The economic-analysis questions the proposal raises (short-termism, fraud risk, compliance burden, retail-investor protection) are central to what we study, and the comment period is a good moment to bring our expertise to bear. Second, it is quite interesting to test how well language-model classifiers handle regulatory text. With a few hundred letters and fast iteration, can we converge on classifier design choices that scale to future research? The site will be most relevant to accounting academics, auditors, preparers, financial analysts, IR professionals, and anyone who follows the SEC docket professionally.
Comments, suggestions, or corrections welcome — send feedback here or email zach.7@osu.edu.
Each letter card in the table below carries an agreement badge (Unanimous / 2-of-3) and shows all three rater calls on hover.
I then ran the same 3 rubric prompts through ChatGPT-5.5 as an independent second ensemble. The question I wanted to answer: would the stance calls hold up under a different model family? Carlson & Burbano (SMJ 2026) recommend this kind of cross-model check where feasible.
GPT-Majority vs Claude-Majority: 132 / 137 = 96.4% raw, Cohen's κ = 0.886. Substantial cross-model agreement on the aggregated label.
Per-rubric agreement varies:
The Skeptic divergence reflects a rubric-conditioning effect. The same "default to Conditional unless unambiguous" instruction yielded 83 Conditional calls in GPT-5.5 versus 36 in Claude Opus 4.7. Same prompt, different operationalization across model families. Aggregate agreement on the majority vote holds; per-rubric agreement is more model-dependent.
5 letters fall outside the cross-model majority match: #2 Fardeen Irani, #13 Skyler Mathis, #43 Steven A. Collazo, #80 Bayo Olabisi, #122 Tal Madison. All 5 push from Claude's Support or Oppose call into GPT's Conditional call. 4 of the 5 already had at least one Claude rater calling Conditional, so the cross-model disagreement concentrates on the hedge-boundary letters Claude's own ensemble was already split on.
Substantially higher agreement than the stance ensemble (κ = 0.514). Majority headline distribution: Individual 134 / Accountant (CPA) 10 / Investment professional 10 / Issuer-current 9 / Industry practitioner 8 / Issuer-former 5 / Academic researcher 4 / Student 2.
The same 3 rubric prompts ran through ChatGPT-5.5 as an independent second ensemble.
GPT-Majority vs Claude-Majority: 114 / 137 = 83.2%, Cohen's κ = 0.621. Moderate cross-model agreement on the aggregated label.
Per-rubric agreement:
The pattern is systematic. GPT-5.5 has a stronger "Individual" prior than Claude Opus 4.7 across all three rubrics. The biggest split is on writers who sign as "CFO, ACME Corp" or similar institutional role but write in a personal register engaging investor-protection concerns rather than issuer-specific concerns: GPT-Primary classifies as Individual; Claude-Primary classifies as Issuer-current. Both readings are defensible under the rubric. The rubric requires a "speaking-as" judgment, and the two model families weight surface role vs register differently.
Intra-model Fleiss κ is 0.880 for Claude and 0.775 for GPT. Within-model agreement holds for both ensembles; the divergence is across model families.
23 letters fall outside the cross-model majority match. 18 of 23 flow into GPT-Individual from a more specific Claude bucket. 6 of these 23 are already on the contested-letters list internal to Claude's own three-rater ensemble.
Anti-proposal codes use a red shade and pro-proposal codes a green shade, so the directional balance reads at a glance. Every SEC-engaged code carries a verbatim quote from the proposing release. The quote shows how the SEC itself frames the argument.
SG (signaling) and OV (option value) sit at floor κ. Virtually no commenter cites these codes, so the floor reflects how rare they are in the data.
The same 3 rubric prompts ran through ChatGPT-5.5 as an independent second ensemble.
GPT-Majority vs Claude-Majority: mean per-code Cohen's κ = 0.477. Moderate cross-model agreement, lower than the stance ensemble (κ = 0.886) and the entity ensemble (κ = 0.621). The ranking is consistent with rationale being the most inferential and multi-label of the three ensembles. Set-level exact match (GPT majority code set identical to Claude majority code set): 34 of 137 letters, 24.8%. Mean Jaccard similarity between the two majority sets: 0.52.
Per-rubric mean κ across the 20 codes:
Surface-readable codes converge across model families: MF (κ = 0.79), CMP (κ = 0.76), FR (κ = 0.70), LE (κ = 0.70), NR (κ = 0.60). Inferential or umbrella codes diverge: EX (κ = 0.20), IP (κ = 0.32), OP (κ = 0.32), AU (κ = 0.39).
The rubric-conditioning pattern visible in the stance and entity ensembles shows up again. GPT-Inclusive fires 4.50 codes per letter; Claude-Inclusive fires 2.84. Same "code whenever plausibly invoked" instruction, very different operationalization. Claude's three rationale raters stay within a 25% spread of each other (2.25 to 2.84 codes per letter); GPT's three span nearly 2x (2.44 to 4.50). The 3-rater majority κ (0.477) is higher than two of the three matched-rubric κs, which shows that aggregation dampens cross-model variance just as it does within-model.
Letters grouped by who submitted them, color-coded by stance.
Letters grouped by word count, color-coded by stance.
For each entity type, how its letters distribute across word-count buckets.
Same predictors across all three models (7 entity dummies with Individual as reference, plus log(words+1)). The Logit and LPM share a binary outcome (Support=1 / Oppose=0, Conditional dropped). The ordinal logit uses the full 3-class outcome (Oppose < Conditional < Support). Each cell shows coefficient on top, SE in parentheses below, p-value in italics underneath.
| Variable | Logit Support vs Oppose |
Ordinal logit Oppose < Cond. < Support |
LPM (OLS) Support vs Oppose, HC1 |
|---|---|---|---|
| Constant | −5.64 *** (1.57) p = 0.000 | — (cutpoints below) | −0.032 (0.035) p = 0.359 |
| Accountant CPA | +1.39 (1.15) p = 0.229 | +1.39 * (0.72) p = 0.055 | +0.065 (0.088) p = 0.458 |
| Issuer-current | +1.85 (1.25) p = 0.140 | +2.02 *** (0.67) p = 0.002 | +0.160 (0.191) p = 0.403 |
| Issuer-former | separated | +0.67 (1.13) p = 0.553 | separated |
| Investment prof. | separated | separated | separated |
| Academic | +2.17 (1.84) p = 0.239 | +2.16 ** (0.93) p = 0.020 | +0.425 (0.349) p = 0.224 |
| Industry pract. | separated | −0.56 (1.14) p = 0.624 | separated |
| Student | +3.49 ** (1.54) p = 0.023 | +3.21 * (1.92) p = 0.094 | +0.466 (0.373) p = 0.211 |
| log(words+1) | +0.46 (0.33) p = 0.166 | +0.61 *** (0.18) p = 0.001 | +0.014 (0.011) p = 0.174 |
| N | 249 | 273 | 249 |
| Fit | McFadden R² = 0.196 | McFadden R² = 0.212 | R² = 0.132 (adj. 0.104) |
| Log-likelihood | -31.11 | -94.52 | (OLS) |
Each letter can invoke zero or more argument families (multi-label). 20-code taxonomy anchored on the SEC's proposing release — see the argument taxonomy for code definitions and verbatim SEC quotes.
Same rationales, stacked by the stance of the letter that cited them. Hover any code pill on the y-axis for a short explanation.
| # | Date | Commenter | Role | Stance | Words | Rationales |
|---|