8Z Post-Codec Algorithmic Noise Scout
Can deterministic mathematical noise help describe compressed, codec-internal, or 8Z-internal streams as generator + residual?
v0.7.1 tone fix keeps the v0.3.3 balanced result and reporting cleanup: candidate-level positives are separated from unique/promoted discoveries, final-byte wins remain zero, and the next run plan is staged deep search, not blind brute force.
Strong version
Search for the shortest exact description across representation layers. A deterministic generator only matters if generator + parameters + transform + residual + overhead + search cost beats a fair baseline and survives same-budget controls.
Claim boundary
The claim is not a shortcut or guarantee that ZIP files are hidden in π. The stronger claim is testable: deterministic generators may sometimes reduce residual cost on selected representation layers when MDL and controls say yes.
Good branch. Not proven. Build the scout carefully.
The proposal is worth pursuing as an 8Z / M Hunter research scout. It should not be sold as proof that arbitrary ZIP, 7Z, PNG, FLAC, JPG, or MP3 files can already be compressed 5–15% further. The useful claim is narrower and stronger: deterministic algorithmic-noise generators may reduce residual cost on selected representation layers, especially codec-internal token/residual streams and 8Z-internal residuals.
R1 synthesis: the reviewers did not kill the idea. They hardened it. The scout must be designed to lose honestly before it is allowed to win.
Generator + residual under strict MDL.
X = original file B = classical_or_codec_layer(X) G = deterministic_noise_generator(theta, len(B)) R = residual_transform(B, G) Accept iff: L(generator_family + theta + transform + residual_codec(R) + overhead + phi_cost + search_budget_cost) < L_fair_baseline(B) - safety_margin
Constants, small PRNGs, cellular automata, later LFSR/GF and formula synthesis.
XOR, submod256, bitplane XOR, nibble transforms, blockwise transforms.
MDL, fair baseline, controls, exact SHA3 reconstruction, and replication.
Final compressed bytes are the hardest layer.
| Priority | Layer | R1-adjusted stance |
|---|---|---|
| 1 | 8Z-internal residual/token layers | Best practical target because 8Z controls the representation and can expose structure before final entropy whitening. |
| 2 | Codec-internal residual/token layers | Strong external-codec target: PNG filters, FLAC LPC residuals, DEFLATE literals, LZ match lengths/distances, JPEG/MP3 coefficient domains. |
| 3 | Synthetic / procedural streams | Positive-control and hero-file layer. Good for proving the harness works. |
| 4 | Selected weak/structured final compressed files | Possible isolated wins, but must beat stronger-codec baselines. |
| 5 | Ordinary final ZIP/7Z/PNG/FLAC/JPG/MP3 bytes | Stress test and likely negative-control layer, not the main expected source of wins. |
Ten reviewers, one main correction.
The strongest R1 message: descriptor cost and phi_cost are not enough. Empirical same-budget nulls and generator-ablation controls must be mandatory.
| Rank | LLM | Paper score | Contribution | Best contribution |
|---|---|---|---|---|
| 1 | Claude Opus 4.8 | 84 | 94 | Found the key evaluation flaw; required empirical null + generator ablation. |
| 2 | ChatGPT 5.5 Pro | 87 | 90 | Broad synthesis, v0.1 falsification harness, T1–T5 forecast split. |
| 3 | Gemini 3.1 Pro | 88 | 89 | Engineering guardrails: chunk floors, strict seeds, log2 search charge. |
| 4 | MiniMax M3 Thinking | 78 | 88 | Same-codec trap, evaluated trial count, chunk-map and layer costs. |
| 5 | Qwen 3.7 Plus | 88 | 86 | Approve-with-gates and build sequencing. |
| 6 | Grok 4.3 Expert | 81 | 84 | Final bytes as stress/negative target. |
| 7 | Kimi 2.6 Thinking | 78 | 84 | Build-less discipline and codec-internal emphasis. |
| 8 | MetaAI Thinking | 76 | 80 | Lock constants, hard search caps, phi calibration. |
| 9 | GLM 4.7 DeepThink | 75 | 78 | Clean module split and search-budget cost framing. |
| 10 | DeepSeek V4 DeepThink | 72 | 76 | Caution on over-optimism and naming/tightening. |
v0.6 changes the acceptance rule from OR to AND.
strict MDL win OR survives same-budget controls
This is too weak. It lets lucky search survive.
strict MDL win AND beats fair baseline AND beats generator ablation AND exceeds same-budget empirical null AND exact reconstruction verifies
This is the new minimum gate.
Fair baseline: compare against min(len(B), strong_codec(B), strong_codec(layer)), not just the source container size.
Bold seed, stricter probabilities.
| Target | BD forecast | GPT/R1-adjusted stance |
|---|---|---|
| T1 ordinary final compressed files | 80% for 5–15%; 5% for 50% | Too high for average arbitrary final bytes. Likely low single digits for broad average wins. |
| T2 selected final compressed files | Implicitly included | Possible isolated 5–15% wins on weak/structured/procedural cases. |
| T3 codec-internal token/residual layers | Not separated in original seed | Material chance of useful wins; probably best external-codec target. |
| T4 8Z-internal residual/token layers | Not separated in original seed | Strongest practical 8Z target. |
| T5 synthetic/procedural hero files | High explorer prior | High chance raw/weak layers; lower but important after strong final compression. |
Build a falsification harness first.
- CONST_BYTE_STREAM
- PRNG_SMALL
- CA_NOISE: rules 30, 90, 110, 184
- NO_OP_ABLATION permanent baseline
- xor, submod256, bitplane-xor
- zlib/lzma/zstd residual codecs
- CONST_COMBO until null calibration is stable
- LFSR/GF solvers
- formula synthesis
- neural/texture models
- production 8Z encoder integration
Synthetic CA positive control
Known CA raw stream should be recovered as STRUCTURED_POSITIVE.
Known CA → zstd/lzma → recovery attempt
This tests whether final-byte compression destroys recoverable generator signal.
8Z_Mosaic / Appendix-Q-style residual reproduction
Confirm known structured residual wins under the new harder gates.
Codec-internal layers
Move to PNG/FLAC/DEFLATE/LZ token and residual layers.
v0.3.3 balanced master passed: harness yes, production compression no
This is the website-facing result layer. The arena writes local diagnostic reports inside each run folder, but public claims belong here only after the live extractor / updater has summarized the run and the results have been deduplicated.
6
6 finished · 0 ongoing or partial.
2,254,006
Total logged candidate trials visible in summaries / JSONL logs.
2,616
Candidate-level structured positives. These are not yet unique file-level discoveries.
8
Positive controls proving the harness can recover known generated structure.
0
Ordinary final compressed byte wins remain unproven.
0
Reported random false-positive rate in the public summaries.
realish_s12345_numeric_payload_zlib.bin. The page now separates candidate positives from unique/promoted discoveries. The next arena/reporting patch should produce unique_positive_summary.json.What v0.3.3 proves vs does not prove
| Item | Status | Public interpretation |
|---|---|---|
| Selftest + smoke | PASS | Basic engine, controls, and synthetic recovery path are working. |
| Balanced real-layer panel | 64/64 complete | The arena can run a real representation-layer panel with heartbeats and bounded logging. |
| Candidate positives | 2,616 | Interesting search signal, but must be deduplicated and replicated before becoming a claim. |
| Control failures visible | 30/176 | Good reason to preserve controls and audit tables; do not hide rejected rows. |
| Final-byte wins | 0 | The hard final-compressed-file claim remains unproven and should stay separate. |
| Random FPR | 0 reported | Promising, but keep same-budget, shuffle, block-bootstrap, Markov, and generator-ablation controls. |
Run summary
| Run | Mode | Verdict | Progress | Candidates | FPR | Candidate + | Best visible hit |
|---|---|---|---|---|---|---|---|
| standard_balanced_4w | panel | REAL_REPRESENTATION_LAYER_SIGNAL_REVIEW_REQUIRED | 64/64 (100.0%) | 2,250,000 | 0.0 | 2,600 | realish numeric zlib payload · PRED_LINEAR_U8 · xor-prev · 93.95% |
| smoke | panel | CODEC_INTERNAL_STRUCTURED_SIGNAL_REVIEW_REQUIRED | 7/7 (100.0%) | 42 | 0.0 | 2 | synthetic CA rule90 PNG IDAT · 99.76% |
| selftest | panel | SYNTHETIC_RECOVERY_ONLY | 2/2 (100.0%) | 41 | 0.0 | 1 | synthetic CA rule90 raw stream · 99.88% |
| quick_panel_4w | panel | REAL_REPRESENTATION_LAYER_SIGNAL_REVIEW_REQUIRED | 32/32 (100.0%) | 3,840 | 0.0 | 10 | synthetic CA seeded PNG IDAT · 99.76% |
| quick_smoke | panel | CODEC_INTERNAL_STRUCTURED_SIGNAL_REVIEW_REQUIRED | 7/7 (100.0%) | 42 | 0.0 | 2 | synthetic CA rule90 PNG IDAT · 99.76% |
| selftest duplicate | panel | SYNTHETIC_RECOVERY_ONLY | 2/2 (100.0%) | 41 | 0.0 | 1 | synthetic CA rule90 raw stream · 99.88% |
Deduped visible hit families
| Family | Layer | Best visible file / case | Best visible savings | Generator | Interpretation |
|---|---|---|---|---|---|
| Synthetic CA raw | raw | synthetic_ca_rule90_64k.bin | +32,728 B · 99.88% | CA_RULE_90 + xor | Positive control. The harness can recover known generated structure. |
| Synthetic CA inside PNG/IDAT | codec-internal | synthetic_ca_rule90_png_idat.png | +16,344 B · 99.76% | CA_RULE_90 + xor | Positive control through a codec-internal representation. |
| Synthetic CA then zlib | codec-internal | synthetic_ca_rule90_then_zlib.bin | +16,344 B · 99.76% | CA_RULE_90 + xor | Positive control showing recoverable generated structure before/inside compressed representation. |
| Numeric structured payloads | codec-internal | realish_s12345_numeric_payload_zlib.bin | +3,448 B · 93.95% in balanced; +12,936 B · 96.97% in quick panel | PRED_LINEAR_U8 + xor-prev | Interesting representation-layer signal. Needs dedupe, holdout, stronger baselines, and replication. |
Generated from v0.3.3 balanced results. Max reported random FPR: 0. Control-failure rows visible in prior public summary: 30/176. For forensic analysis, use the live bundle ZIP, not only this public page.
The scout must not become a coincidence miner.
Huge offset/rule/transform searches can find lucky local wins.
If a stronger codec alone wins, do not credit math.
Small chunks can create fake victories. Report them only as debug.
π is a deterministic shared source, but the tested signal may come from CA rules, mappings, geometry, or layer choice. The arena must measure which part actually carries compression value.
Next: run longer, but in controlled waves.
Yes, this arena should eventually run for days or even weeks. But the right move is not blind brute force across every family, chunk size, transform, and target. The balanced pass already showed the branch is alive; now the goal is to turn candidate-level signal into unique, replicated, control-surviving discoveries.
Use multi-day compute where it matters: real representation layers, codec-internal residual/token streams, 8Z-internal payloads, and families that survived the balanced panel.
Do not spend weeks mining final compressed byte soup or tiny chunks until dedupe, controls, and reporting are strong enough to avoid fake discoveries.
Seed → bridge → test → result Seed: v0.3.3 found clean synthetic recovery and representation-layer candidate positives. Bridge: dedupe + controls + targeted deep search turn candidate positives into evidence. Test: run staged deep panels with live extracts, unique-positive summaries, and holdout replication. Result: promote only real layer families that beat fair baselines after search cost and controls.
Recommended run ladder
v0.3.4 reporting polish
Add unique_positive_summary.json, dedupe by file/layer/generator/transform/seed, and split candidate positives vs unique positives vs replicated positives. This should happen before week-scale runs.
12–24h confirmation run
Run real-layer panel at 250k candidates per target. Keep all controls and live-extract every few hours.
RUN_30_REAL_LAYER_PANEL_NO_PAUSE.bat 4 250000
2–3 day deep representation-layer run
Use only real/codec-internal representation layers. Keep top-k logging, same-budget nulls, shuffle, block-bootstrap, Markov, and generator ablation. Avoid final-byte stress except as a separate negative control.
RUN_50_WEEKEND_DEEP_REAL_LAYER_NO_PAUSE.bat 4 500000
Weekend / week Atlas-candidate run
Spend the large budget only on families that survive Stage 1–2. Require replicated positives across seeds and holdouts before calling anything an Atlas candidate.
RUN_00_MASTER_ALL_TESTS_NO_PAUSE.bat 4 1000000 250000 weekend 0
True-final stress check
Run final compressed bytes separately as a hard negative/stress layer. A zero result is useful because it confirms the paper’s strongest caution: final bytes are the hardest layer.
RUN_40_TRUE_FINAL_STRESS_NO_PAUSE.bat 4 50000 balanced
RUN_30_REAL_LAYER_PANEL_NO_PAUSE.bat 4 250000. If the same families survive, then launch the weekend/deep run. Longer compute is justified, but only after the arena can tell unique discovery from repeated candidate hits.