SEC Semi-Annual Reporting Proposal Tracker (S7-2026-15)

In May 2026 the SEC issued a proposal that would allow public companies to switch from quarterly (Form 10-Q) to semi-annual financial reporting (a new Form 10-S). From the date of publication, the public has 60 days to comment on the proposal.

This site was produced by Professor Tzachi Zach at The Ohio State University Fisher College of Business as a public service to track the comment letters as they arrive, classify their positions, and surface patterns in the docket. A number of methodological decisions had to be made along the way — how to classify stance, how to bucket commenters by entity type, how to handle hedged or conditional letters — those decisions are explained below.

Beyond the tracker itself, the project has two other goals. First, I hope it will encourage discussion among accounting academics about the proposal. The economic-analysis questions the proposal raises (short-termism, fraud risk, compliance burden, retail-investor protection) are central to what we study, and the comment period is a good moment to bring our expertise to bear. Second, it is quite interesting to test how well language-model classifiers handle regulatory text. With a few hundred letters and fast iteration, can we converge on classifier design choices that scale to future research? The site will be most relevant to accounting academics, auditors, preparers, financial analysts, IR professionals, and anyone who follows the SEC docket professionally.

Comments, suggestions, or corrections welcome — send feedback here or email zach.7@osu.edu.

Loading…

Thanks
Thanks to Mert Erinc and Brian Monsen for helpful comments and suggestions on the classification methodology.
How stances are classified (three-rater LLM ensemble)
I had three Claude raters classify each letter independently. The headline stance shown here is the majority vote across the three. The LLM-annotation literature (Carlson & Burbano, SMJ 2026; Liu, CHB 2026) recommends multi-prompt validation over a single classifier call, and that is what this design does.
Rater 1 — Primary
Balanced read of the whole letter. Assigns Support / Oppose / Conditional based on the dominant position.
Rater 2 — Literalist
Defaults to Oppose. Flips to Support only on explicit, unconditional endorsement language ("I support…", "I urge the Commission to adopt…"). Conditional only fires on explicit "if X then yes" structure.
Rater 3 — Skeptic
Defaults to Conditional unless the letter is unambiguous. Any qualification, concession, or alternative proposal flips the call to Conditional.
Agreement across the three raters (N = 273):
  • Unanimous: 235 (86.1%)
  • 2-of-3 majority: 38 (13.9%)
  • Fleiss' κ: 0.600 · pairwise Cohen's κ: Primary–Literalist 0.68, Primary–Skeptic 0.67, Literalist–Skeptic 0.48

Each letter card in the table below carries an agreement badge (Unanimous / 2-of-3) and shows all three rater calls on hover.

Cross-model validation (ChatGPT-5.5, n = 137 overlap).

I then ran the same 3 rubric prompts through ChatGPT-5.5 as an independent second ensemble. The question I wanted to answer: would the stance calls hold up under a different model family? Carlson & Burbano (SMJ 2026) recommend this kind of cross-model check where feasible.

GPT-Majority vs Claude-Majority: 132 / 137 = 96.4% raw, Cohen's κ = 0.886. Substantial cross-model agreement on the aggregated label.

Per-rubric agreement varies:

  • GPT-Primary vs Claude-Primary: κ = 0.816 (substantial)
  • GPT-Literalist vs Claude-Literalist: κ = 0.635 (moderate-substantial)
  • GPT-Skeptic vs Claude-Skeptic: κ = 0.400 (moderate)

The Skeptic divergence reflects a rubric-conditioning effect. The same "default to Conditional unless unambiguous" instruction yielded 83 Conditional calls in GPT-5.5 versus 36 in Claude Opus 4.7. Same prompt, different operationalization across model families. Aggregate agreement on the majority vote holds; per-rubric agreement is more model-dependent.

5 letters fall outside the cross-model majority match: #2 Fardeen Irani, #13 Skyler Mathis, #43 Steven A. Collazo, #80 Bayo Olabisi, #122 Tal Madison. All 5 push from Claude's Support or Oppose call into GPT's Conditional call. 4 of the 5 already had at least one Claude rater calling Conditional, so the cross-model disagreement concentrates on the hedge-boundary letters Claude's own ensemble was already split on.

Stance label conventions:
Oppose
Author argues against adoption.
Support
Author explicitly endorses the proposal.
Conditional (in-between / mixed)
Author wants modifications or alternatives (e.g. enhanced auditor assurance, monthly revenue disclosure, every-4-months cadence, qualifying-criteria framework). Would not vote yes on the rule as written.
How commenters are classified by entity (three-rater LLM ensemble)
Letters fall into one of eight buckets by who is writing. As with stance, three rubrics classify each letter independently, and the headline bucket is the majority of three. A colleague with FASB comment-letter experience helped refine the taxonomy.
The eight buckets:
  1. Individual — default dump bucket. We use "Individual" (not "Individual investor").
  2. Accountant (CPA) — CPA or chartered accountant credential, speaking from that professional lens.
  3. Issuer / Corporate — current — active corporate role (CFO, audit chair, financial reporting manager, etc.).
  4. Issuer / Corporate — former — retired or former executives writing personally. Plausibly different incentives from current insiders.
  5. Investment professional — active asset managers, hedge fund principals, RIAs, financial advisors.
  6. Academic researcher — university faculty appointment.
  7. Industry practitioner / technologist — non-academic professional roles outside corporate-issuer / investment-firm worlds (CISSP, software developer, IT auditor, compliance professional, etc.).
  8. Student — currently enrolled student.
Rater 1 — Primary
Balanced read. Uses the "follow the letterhead" principle as a tiebreaker: classify by the affiliation under which the writer is speaking. The institutional-vs-personal call comes from register, length, substance, and whether the institution is named for credibility or for attribution.
Rater 2 — Self-described
Takes the literal first identifier the writer offers. "CPA and retail investor" → Accountant; "Individual investor and former CFO" → Individual. No override based on context.
Rater 3 — Letterhead / functional
Overrides self-description with the strongest functional credential. Priority: current institutional role > former institutional role > formal professional credential > sector descriptor > self-id.
Agreement across the three raters (N = 182):
  • Unanimous: 168 (92.3%)
  • 2-of-3 majority: 14 (7.7%)
  • Fleiss' κ: 0.878

Substantially higher agreement than the stance ensemble (κ = 0.514). Majority headline distribution: Individual 134 / Accountant (CPA) 10 / Investment professional 10 / Issuer-current 9 / Industry practitioner 8 / Issuer-former 5 / Academic researcher 4 / Student 2.

Cross-model validation (ChatGPT-5.5, n = 137 overlap).

The same 3 rubric prompts ran through ChatGPT-5.5 as an independent second ensemble.

GPT-Majority vs Claude-Majority: 114 / 137 = 83.2%, Cohen's κ = 0.621. Moderate cross-model agreement on the aggregated label.

Per-rubric agreement:

  • GPT-Primary vs Claude-Primary: κ = 0.657 (substantial)
  • GPT-Self-described vs Claude-Self-described: κ = 0.593 (moderate-substantial)
  • GPT-Letterhead vs Claude-Letterhead: κ = 0.584 (moderate-substantial)

The pattern is systematic. GPT-5.5 has a stronger "Individual" prior than Claude Opus 4.7 across all three rubrics. The biggest split is on writers who sign as "CFO, ACME Corp" or similar institutional role but write in a personal register engaging investor-protection concerns rather than issuer-specific concerns: GPT-Primary classifies as Individual; Claude-Primary classifies as Issuer-current. Both readings are defensible under the rubric. The rubric requires a "speaking-as" judgment, and the two model families weight surface role vs register differently.

Intra-model Fleiss κ is 0.880 for Claude and 0.775 for GPT. Within-model agreement holds for both ensembles; the divergence is across model families.

23 letters fall outside the cross-model majority match. 18 of 23 flow into GPT-Individual from a more specific Claude bucket. 6 of these 23 are already on the contested-letters list internal to Claude's own three-rater ensemble.

How rationales are classified (argument taxonomy anchored on the SEC release; three-rater LLM ensemble)
Each letter can invoke one or more argument families. The taxonomy starts from the SEC's framing in the proposing release (Release Nos. 33-11414; 34-105368; File No. S7-2026-15). Three commenter-distinctive codes cover arguments the SEC does not engage as standalone justifications. 20 codes total: 16 SEC-engaged (9 anti-proposal, 5 pro-proposal, 1 conditional, 1 procedural), 3 commenter-distinctive (IP investor protection; US capital-market leadership; RI investor reliance interests), and 1 "no substantive rationale" for letters that state a position without engaging an argument.

Anti-proposal codes use a red shade and pro-proposal codes a green shade, so the directional balance reads at a glance. Every SEC-engaged code carries a verbatim quote from the proposing release. The quote shows how the SEC itself frames the argument.

Three-rater LLM ensemble. Rationale coding is multi-label: a letter can invoke 0+ codes. Three rubrics classify each letter independently, and the public-facing code list is the per-(letter, code) majority vote across the three.
Rater 1 — Primary
Balanced read. Codes the rationale families the writer substantively argues, even when not framed in the SEC's exact language.
Rater 2 — Literalist
Strict. Codes a family only when the letter explicitly invokes that framing in surface text.
Rater 3 — Inclusive
Expansive. Codes a family whenever plausibly invoked, including allusive references and arguments the writer relies on without fully developing.
Agreement across the three raters (N = 182):
  • Unanimous (all three raters produced identical code sets): 99 (54.4%)
  • 2-of-3 majority: 70 (38.5%)
  • Split (three different code sets): 13 (7.1%)
  • Mean per-code Cohen's κ across pairs: 0.804. Substantial agreement, in line with multi-prompt LLM-annotation benchmarks.
Per-code Cohen's κ (binary code-present vs absent, mean across the three pairwise comparisons). Surface-readable codes have high κ; inferential codes have lower κ. The methodology surfaces the structure of the taxonomy.
ICc 1.00 · ICs 1.00 · US 0.95 · FR 0.92 · AL 0.90 · ST 0.89 · EX 0.89 · IA 0.89 · LE 0.89 · OP 0.88 · NR 0.85 · MF 0.83 · IP 0.82 · CB 0.81 · AU 0.79 · RI 0.74 · CMP 0.69 · SG 0.33 · OV 0.22

SG (signaling) and OV (option value) sit at floor κ. Virtually no commenter cites these codes, so the floor reflects how rare they are in the data.

Cross-model validation (ChatGPT-5.5, n = 137 overlap).

The same 3 rubric prompts ran through ChatGPT-5.5 as an independent second ensemble.

GPT-Majority vs Claude-Majority: mean per-code Cohen's κ = 0.477. Moderate cross-model agreement, lower than the stance ensemble (κ = 0.886) and the entity ensemble (κ = 0.621). The ranking is consistent with rationale being the most inferential and multi-label of the three ensembles. Set-level exact match (GPT majority code set identical to Claude majority code set): 34 of 137 letters, 24.8%. Mean Jaccard similarity between the two majority sets: 0.52.

Per-rubric mean κ across the 20 codes:

  • GPT-Primary vs Claude-Primary: κ = 0.483
  • GPT-Literalist vs Claude-Literalist: κ = 0.365
  • GPT-Inclusive vs Claude-Inclusive: κ = 0.361

Surface-readable codes converge across model families: MF (κ = 0.79), CMP (κ = 0.76), FR (κ = 0.70), LE (κ = 0.70), NR (κ = 0.60). Inferential or umbrella codes diverge: EX (κ = 0.20), IP (κ = 0.32), OP (κ = 0.32), AU (κ = 0.39).

The rubric-conditioning pattern visible in the stance and entity ensembles shows up again. GPT-Inclusive fires 4.50 codes per letter; Claude-Inclusive fires 2.84. Same "code whenever plausibly invoked" instruction, very different operationalization. Claude's three rationale raters stay within a 25% spread of each other (2.25 to 2.84 codes per letter); GPT's three span nearly 2x (2.44 to 4.50). The 3-rater majority κ (0.477) is higher than two of the three matched-rubric κs, which shows that aggregation dampens cross-model variance just as it does within-model.

Full argument taxonomy with SEC quotes →

Letters per day

Stance totals

Stance by entity type

Letters grouped by who submitted them, color-coded by stance.

Stance by letter length

Letters grouped by word count, color-coded by stance.

Letter length by entity type

For each entity type, how its letters distribute across word-count buckets.

1–50w51–150w151–300w301–600w600+w

Regression: predictors of stance — three specifications

Same predictors across all three models (7 entity dummies with Individual as reference, plus log(words+1)). The Logit and LPM share a binary outcome (Support=1 / Oppose=0, Conditional dropped). The ordinal logit uses the full 3-class outcome (Oppose < Conditional < Support). Each cell shows coefficient on top, SE in parentheses below, p-value in italics underneath.

Variable Logit
Support vs Oppose
Ordinal logit
Oppose < Cond. < Support
LPM (OLS)
Support vs Oppose, HC1
Constant
−5.64 ***
(1.57)
p = 0.000
— (cutpoints below)
−0.032
(0.035)
p = 0.359
Accountant CPA
+1.39
(1.15)
p = 0.229
+1.39 *
(0.72)
p = 0.055
+0.065
(0.088)
p = 0.458
Issuer-current
+1.85
(1.25)
p = 0.140
+2.02 ***
(0.67)
p = 0.002
+0.160
(0.191)
p = 0.403
Issuer-formerseparated
+0.67
(1.13)
p = 0.553
separated
Investment prof.separatedseparatedseparated
Academic
+2.17
(1.84)
p = 0.239
+2.16 **
(0.93)
p = 0.020
+0.425
(0.349)
p = 0.224
Industry pract.separated
−0.56
(1.14)
p = 0.624
separated
Student
+3.49 **
(1.54)
p = 0.023
+3.21 *
(1.92)
p = 0.094
+0.466
(0.373)
p = 0.211
log(words+1)
+0.46
(0.33)
p = 0.166
+0.61 ***
(0.18)
p = 0.001
+0.014
(0.011)
p = 0.174
N 249 273 249
Fit McFadden R² = 0.196 McFadden R² = 0.212 R² = 0.132 (adj. 0.104)
Log-likelihood -31.11 -94.52 (OLS)
Proportional-odds assumption (LR test of ordinal vs. multinomial logit)
Compared the ordinal logit (restricted, single slope vector across both cuts) against an unrestricted multinomial logit on the same predictors. LR = 2 × (-89.18 − -94.52) = 10.68, df = 8, p = 0.220. Do not reject proportional-odds. Caveat: low power — only 9 Support letters in the sample.

Rationales cited

Each letter can invoke zero or more argument families (multi-label). 20-code taxonomy anchored on the SEC's proposing release — see the argument taxonomy for code definitions and verbatim SEC quotes.

Anti-proposal (red scale) Pro-proposal (green scale) Conditional Procedural / legal No rationale

Rationales by stance

Same rationales, stacked by the stance of the letter that cited them. Hover any code pill on the y-axis for a short explanation.

Longest letters

All letters

# Date Commenter Role Stance Words Rationales