Post-codec · deterministic noise basis · strict MDL

8Z Post-Codec Algorithmic Noise Scout

Can deterministic mathematical noise help describe compressed, codec-internal, or 8Z-internal streams as generator + residual?

v0.7.1 tone fix keeps the v0.3.3 balanced result and reporting cleanup: candidate-level positives are separated from unique/promoted discoveries, final-byte wins remain zero, and the next run plan is staged deep search, not blind brute force.

M Hunter branchnot production encoder yetforecast ledgerR1 reviewedstrict controls requiredv0.3.3 balanced pass

Strong version

Search for the shortest exact description across representation layers. A deterministic generator only matters if generator + parameters + transform + residual + overhead + search cost beats a fair baseline and survives same-budget controls.

Claim boundary

The claim is not a shortcut or guarantee that ZIP files are hidden in π. The stronger claim is testable: deterministic generators may sometimes reduce residual cost on selected representation layers when MDL and controls say yes.

Public stance: This page treats the idea as a serious 8Z research seed. Skeptical wording is only about claim level and evidence gates, not about dismissing the idea.

1. Clear verdict

Good branch. Not proven. Build the scout carefully.

The proposal is worth pursuing as an 8Z / M Hunter research scout. It should not be sold as proof that arbitrary ZIP, 7Z, PNG, FLAC, JPG, or MP3 files can already be compressed 5–15% further. The useful claim is narrower and stronger: deterministic algorithmic-noise generators may reduce residual cost on selected representation layers, especially codec-internal token/residual streams and 8Z-internal residuals.

R1 synthesis: the reviewers did not kill the idea. They hardened it. The scout must be designed to lose honestly before it is allowed to win.

2. Core model

Generator + residual under strict MDL.

X = original file
B = classical_or_codec_layer(X)
G = deterministic_noise_generator(theta, len(B))
R = residual_transform(B, G)

Accept iff:
L(generator_family + theta + transform + residual_codec(R)
  + overhead + phi_cost + search_budget_cost)
< L_fair_baseline(B) - safety_margin

Generator families

Constants, small PRNGs, cellular automata, later LFSR/GF and formula synthesis.

Residual transforms

XOR, submod256, bitplane XOR, nibble transforms, blockwise transforms.

Truth gate

MDL, fair baseline, controls, exact SHA3 reconstruction, and replication.

3. Target layer priority

Final compressed bytes are the hardest layer.

Priority	Layer	R1-adjusted stance
1	8Z-internal residual/token layers	Best practical target because 8Z controls the representation and can expose structure before final entropy whitening.
2	Codec-internal residual/token layers	Strong external-codec target: PNG filters, FLAC LPC residuals, DEFLATE literals, LZ match lengths/distances, JPEG/MP3 coefficient domains.
3	Synthetic / procedural streams	Positive-control and hero-file layer. Good for proving the harness works.
4	Selected weak/structured final compressed files	Possible isolated wins, but must beat stronger-codec baselines.
5	Ordinary final ZIP/7Z/PNG/FLAC/JPG/MP3 bytes	Stress test and likely negative-control layer, not the main expected source of wins.

4. R1 LLM review synthesis

Ten reviewers, one main correction.

The strongest R1 message: descriptor cost and phi_cost are not enough. Empirical same-budget nulls and generator-ablation controls must be mandatory.

Rank	LLM	Paper score	Contribution	Best contribution
1	Claude Opus 4.8	84	94	Found the key evaluation flaw; required empirical null + generator ablation.
2	ChatGPT 5.5 Pro	87	90	Broad synthesis, v0.1 falsification harness, T1–T5 forecast split.
3	Gemini 3.1 Pro	88	89	Engineering guardrails: chunk floors, strict seeds, log2 search charge.
4	MiniMax M3 Thinking	78	88	Same-codec trap, evaluated trial count, chunk-map and layer costs.
5	Qwen 3.7 Plus	88	86	Approve-with-gates and build sequencing.
6	Grok 4.3 Expert	81	84	Final bytes as stress/negative target.
7	Kimi 2.6 Thinking	78	84	Build-less discipline and codec-internal emphasis.
8	MetaAI Thinking	76	80	Lock constants, hard search caps, phi calibration.
9	GLM 4.7 DeepThink	75	78	Clean module split and search-budget cost framing.
10	DeepSeek V4 DeepThink	72	76	Caution on over-optimism and naming/tightening.

5. Evaluation gates

v0.6 changes the acceptance rule from OR to AND.

Old dangerous gate

strict MDL win
OR survives same-budget controls

This is too weak. It lets lucky search survive.

v0.6 gate

strict MDL win
AND beats fair baseline
AND beats generator ablation
AND exceeds same-budget empirical null
AND exact reconstruction verifies

This is the new minimum gate.

Fair baseline: compare against min(len(B), strong_codec(B), strong_codec(layer)), not just the source container size.

6. Forecast ledger

Bold seed, stricter probabilities.

Target	BD forecast	GPT/R1-adjusted stance
T1 ordinary final compressed files	80% for 5–15%; 5% for 50%	Too high for average arbitrary final bytes. Likely low single digits for broad average wins.
T2 selected final compressed files	Implicitly included	Possible isolated 5–15% wins on weak/structured/procedural cases.
T3 codec-internal token/residual layers	Not separated in original seed	Material chance of useful wins; probably best external-codec target.
T4 8Z-internal residual/token layers	Not separated in original seed	Strongest practical 8Z target.
T5 synthetic/procedural hero files	High explorer prior	High chance raw/weak layers; lower but important after strong final compression.

7. v0.1 scout contract

Build a falsification harness first.

Include

CONST_BYTE_STREAM
PRNG_SMALL
CA_NOISE: rules 30, 90, 110, 184
NO_OP_ABLATION permanent baseline
xor, submod256, bitplane-xor
zlib/lzma/zstd residual codecs

Delay

CONST_COMBO until null calibration is stable
LFSR/GF solvers
formula synthesis
neural/texture models
production 8Z encoder integration

Synthetic CA positive control

Known CA raw stream should be recovered as STRUCTURED_POSITIVE.

Known CA → zstd/lzma → recovery attempt

This tests whether final-byte compression destroys recoverable generator signal.

8Z_Mosaic / Appendix-Q-style residual reproduction

Confirm known structured residual wins under the new harder gates.

Codec-internal layers

Move to PNG/FLAC/DEFLATE/LZ token and residual layers.

8. Python arena status

v0.3.3 balanced master passed: harness yes, production compression no

This is the website-facing result layer. The arena writes local diagnostic reports inside each run folder, but public claims belong here only after the live extractor / updater has summarized the run and the results have been deduplicated.

Current interpretation: the scout recovered known generated structure, found representation-layer signals, and kept reported random FPR at zero. This validates the harness and the representation-layer branch. It does not prove that ordinary ZIP/7Z/PNG/FLAC/JPG/MP3 final files can already be made 5–15% smaller.

Runs found

6

6 finished · 0 ongoing or partial.

Candidates tested

2,254,006

Total logged candidate trials visible in summaries / JSONL logs.

Candidate positives

2,616

Candidate-level structured positives. These are not yet unique file-level discoveries.

Synthetic positives

8

Positive controls proving the harness can recover known generated structure.

Final-byte positives

0

Ordinary final compressed byte wins remain unproven.

Max random FPR

0

Reported random false-positive rate in the public summaries.

Reporting cleanup: older visible tables repeated many rows from the same target, especially realish_s12345_numeric_payload_zlib.bin. The page now separates candidate positives from unique/promoted discoveries. The next arena/reporting patch should produce unique_positive_summary.json.

What v0.3.3 proves vs does not prove

Item	Status	Public interpretation
Selftest + smoke	PASS	Basic engine, controls, and synthetic recovery path are working.
Balanced real-layer panel	64/64 complete	The arena can run a real representation-layer panel with heartbeats and bounded logging.
Candidate positives	2,616	Interesting search signal, but must be deduplicated and replicated before becoming a claim.
Control failures visible	30/176	Good reason to preserve controls and audit tables; do not hide rejected rows.
Final-byte wins	0	The hard final-compressed-file claim remains unproven and should stay separate.
Random FPR	0 reported	Promising, but keep same-budget, shuffle, block-bootstrap, Markov, and generator-ablation controls.

Run summary

Run	Mode	Verdict	Progress	Candidates	Candidate +	Best visible hit
standard_balanced_4w	panel	REAL_REPRESENTATION_LAYER_SIGNAL_REVIEW_REQUIRED	64/64 (100.0%)	2,250,000	2,600	realish numeric zlib payload · PRED_LINEAR_U8 · xor-prev · 93.95%
smoke	panel	CODEC_INTERNAL_STRUCTURED_SIGNAL_REVIEW_REQUIRED	7/7 (100.0%)	42	2	synthetic CA rule90 PNG IDAT · 99.76%
selftest	panel	SYNTHETIC_RECOVERY_ONLY	2/2 (100.0%)	41	1	synthetic CA rule90 raw stream · 99.88%
quick_panel_4w	panel	REAL_REPRESENTATION_LAYER_SIGNAL_REVIEW_REQUIRED	32/32 (100.0%)	3,840	10	synthetic CA seeded PNG IDAT · 99.76%
quick_smoke	panel	CODEC_INTERNAL_STRUCTURED_SIGNAL_REVIEW_REQUIRED	7/7 (100.0%)	42	2	synthetic CA rule90 PNG IDAT · 99.76%
selftest duplicate	panel	SYNTHETIC_RECOVERY_ONLY	2/2 (100.0%)	41	1	synthetic CA rule90 raw stream · 99.88%

Deduped visible hit families

Family	Layer	Best visible file / case	Best visible savings	Generator	Interpretation
Synthetic CA raw	raw	synthetic_ca_rule90_64k.bin	+32,728 B · 99.88%	CA_RULE_90 + xor	Positive control. The harness can recover known generated structure.
Synthetic CA inside PNG/IDAT	codec-internal	synthetic_ca_rule90_png_idat.png	+16,344 B · 99.76%	CA_RULE_90 + xor	Positive control through a codec-internal representation.
Synthetic CA then zlib	codec-internal	synthetic_ca_rule90_then_zlib.bin	+16,344 B · 99.76%	CA_RULE_90 + xor	Positive control showing recoverable generated structure before/inside compressed representation.
Numeric structured payloads	codec-internal	realish_s12345_numeric_payload_zlib.bin	+3,448 B · 93.95% in balanced; +12,936 B · 96.97% in quick panel	PRED_LINEAR_U8 + xor-prev	Interesting representation-layer signal. Needs dedupe, holdout, stronger baselines, and replication.

Public claim level: harness validation + representation-layer signal. Not production compression, not final-byte victory, not proof of BD’s 5–15% forecast yet.

Generated from v0.3.3 balanced results. Max reported random FPR: 0. Control-failure rows visible in prior public summary: 30/176. For forensic analysis, use the live bundle ZIP, not only this public page.

9. Failure modes

The scout must not become a coincidence miner.

False positives

Huge offset/rule/transform searches can find lucky local wins.

Weak baseline trap

If a stronger codec alone wins, do not credit math.

Tiny chunk overfit

Small chunks can create fake victories. Report them only as debug.

π-only overfocus

π is a deterministic shared source, but the tested signal may come from CA rules, mappings, geometry, or layer choice. The arena must measure which part actually carries compression value.

10. Next action

Next: run longer, but in controlled waves.

Yes, this arena should eventually run for days or even weeks. But the right move is not blind brute force across every family, chunk size, transform, and target. The balanced pass already showed the branch is alive; now the goal is to turn candidate-level signal into unique, replicated, control-surviving discoveries.

Do longer runs

Use multi-day compute where it matters: real representation layers, codec-internal residual/token streams, 8Z-internal payloads, and families that survived the balanced panel.

Do not just scale noise

Do not spend weeks mining final compressed byte soup or tiny chunks until dedupe, controls, and reporting are strong enough to avoid fake discoveries.

Seed → bridge → test → result

Seed: v0.3.3 found clean synthetic recovery and representation-layer candidate positives.
Bridge: dedupe + controls + targeted deep search turn candidate positives into evidence.
Test: run staged deep panels with live extracts, unique-positive summaries, and holdout replication.
Result: promote only real layer families that beat fair baselines after search cost and controls.

Recommended run ladder

v0.3.4 reporting polish

Add unique_positive_summary.json, dedupe by file/layer/generator/transform/seed, and split candidate positives vs unique positives vs replicated positives. This should happen before week-scale runs.

12–24h confirmation run

Run real-layer panel at 250k candidates per target. Keep all controls and live-extract every few hours.

RUN_30_REAL_LAYER_PANEL_NO_PAUSE.bat 4 250000

2–3 day deep representation-layer run

Use only real/codec-internal representation layers. Keep top-k logging, same-budget nulls, shuffle, block-bootstrap, Markov, and generator ablation. Avoid final-byte stress except as a separate negative control.

RUN_50_WEEKEND_DEEP_REAL_LAYER_NO_PAUSE.bat 4 500000

Weekend / week Atlas-candidate run

Spend the large budget only on families that survive Stage 1–2. Require replicated positives across seeds and holdouts before calling anything an Atlas candidate.

RUN_00_MASTER_ALL_TESTS_NO_PAUSE.bat 4 1000000 250000 weekend 0

True-final stress check

Run final compressed bytes separately as a hard negative/stress layer. A zero result is useful because it confirms the paper’s strongest caution: final bytes are the hardest layer.

RUN_40_TRUE_FINAL_STRESS_NO_PAUSE.bat 4 50000 balanced

My recommendation: do one reporting hotfix first, then run RUN_30_REAL_LAYER_PANEL_NO_PAUSE.bat 4 250000. If the same families survive, then launch the weekend/deep run. Longer compute is justified, but only after the arena can tell unique discovery from repeated candidate hits.