Post-codec · deterministic noise basis · strict MDL

8Z Post-Codec Algorithmic Noise Scout

Can deterministic mathematical noise help describe compressed, codec-internal, or 8Z-internal streams as generator + residual?

v0.7.1 tone fix keeps the v0.3.3 balanced result and reporting cleanup: candidate-level positives are separated from unique/promoted discoveries, final-byte wins remain zero, and the next run plan is staged deep search, not blind brute force.

M Hunter branchnot production encoder yetforecast ledgerR1 reviewedstrict controls requiredv0.3.3 balanced pass

Strong version

Search for the shortest exact description across representation layers. A deterministic generator only matters if generator + parameters + transform + residual + overhead + search cost beats a fair baseline and survives same-budget controls.

Claim boundary

The claim is not a shortcut or guarantee that ZIP files are hidden in π. The stronger claim is testable: deterministic generators may sometimes reduce residual cost on selected representation layers when MDL and controls say yes.

Public stance: This page treats the idea as a serious 8Z research seed. Skeptical wording is only about claim level and evidence gates, not about dismissing the idea.
1. Clear verdict

Good branch. Not proven. Build the scout carefully.

The proposal is worth pursuing as an 8Z / M Hunter research scout. It should not be sold as proof that arbitrary ZIP, 7Z, PNG, FLAC, JPG, or MP3 files can already be compressed 5–15% further. The useful claim is narrower and stronger: deterministic algorithmic-noise generators may reduce residual cost on selected representation layers, especially codec-internal token/residual streams and 8Z-internal residuals.

R1 synthesis: the reviewers did not kill the idea. They hardened it. The scout must be designed to lose honestly before it is allowed to win.

2. Core model

Generator + residual under strict MDL.

X = original file
B = classical_or_codec_layer(X)
G = deterministic_noise_generator(theta, len(B))
R = residual_transform(B, G)

Accept iff:
L(generator_family + theta + transform + residual_codec(R)
  + overhead + phi_cost + search_budget_cost)
< L_fair_baseline(B) - safety_margin
Generator families

Constants, small PRNGs, cellular automata, later LFSR/GF and formula synthesis.

Residual transforms

XOR, submod256, bitplane XOR, nibble transforms, blockwise transforms.

Truth gate

MDL, fair baseline, controls, exact SHA3 reconstruction, and replication.

3. Target layer priority

Final compressed bytes are the hardest layer.

PriorityLayerR1-adjusted stance
18Z-internal residual/token layersBest practical target because 8Z controls the representation and can expose structure before final entropy whitening.
2Codec-internal residual/token layersStrong external-codec target: PNG filters, FLAC LPC residuals, DEFLATE literals, LZ match lengths/distances, JPEG/MP3 coefficient domains.
3Synthetic / procedural streamsPositive-control and hero-file layer. Good for proving the harness works.
4Selected weak/structured final compressed filesPossible isolated wins, but must beat stronger-codec baselines.
5Ordinary final ZIP/7Z/PNG/FLAC/JPG/MP3 bytesStress test and likely negative-control layer, not the main expected source of wins.
4. R1 LLM review synthesis

Ten reviewers, one main correction.

The strongest R1 message: descriptor cost and phi_cost are not enough. Empirical same-budget nulls and generator-ablation controls must be mandatory.

RankLLMPaper scoreContributionBest contribution
1Claude Opus 4.88494Found the key evaluation flaw; required empirical null + generator ablation.
2ChatGPT 5.5 Pro8790Broad synthesis, v0.1 falsification harness, T1–T5 forecast split.
3Gemini 3.1 Pro8889Engineering guardrails: chunk floors, strict seeds, log2 search charge.
4MiniMax M3 Thinking7888Same-codec trap, evaluated trial count, chunk-map and layer costs.
5Qwen 3.7 Plus8886Approve-with-gates and build sequencing.
6Grok 4.3 Expert8184Final bytes as stress/negative target.
7Kimi 2.6 Thinking7884Build-less discipline and codec-internal emphasis.
8MetaAI Thinking7680Lock constants, hard search caps, phi calibration.
9GLM 4.7 DeepThink7578Clean module split and search-budget cost framing.
10DeepSeek V4 DeepThink7276Caution on over-optimism and naming/tightening.
5. Evaluation gates

v0.6 changes the acceptance rule from OR to AND.

Old dangerous gate
strict MDL win
OR survives same-budget controls

This is too weak. It lets lucky search survive.

v0.6 gate
strict MDL win
AND beats fair baseline
AND beats generator ablation
AND exceeds same-budget empirical null
AND exact reconstruction verifies

This is the new minimum gate.

Fair baseline: compare against min(len(B), strong_codec(B), strong_codec(layer)), not just the source container size.

6. Forecast ledger

Bold seed, stricter probabilities.

TargetBD forecastGPT/R1-adjusted stance
T1 ordinary final compressed files80% for 5–15%; 5% for 50%Too high for average arbitrary final bytes. Likely low single digits for broad average wins.
T2 selected final compressed filesImplicitly includedPossible isolated 5–15% wins on weak/structured/procedural cases.
T3 codec-internal token/residual layersNot separated in original seedMaterial chance of useful wins; probably best external-codec target.
T4 8Z-internal residual/token layersNot separated in original seedStrongest practical 8Z target.
T5 synthetic/procedural hero filesHigh explorer priorHigh chance raw/weak layers; lower but important after strong final compression.
7. v0.1 scout contract

Build a falsification harness first.

Include
  • CONST_BYTE_STREAM
  • PRNG_SMALL
  • CA_NOISE: rules 30, 90, 110, 184
  • NO_OP_ABLATION permanent baseline
  • xor, submod256, bitplane-xor
  • zlib/lzma/zstd residual codecs
Delay
  • CONST_COMBO until null calibration is stable
  • LFSR/GF solvers
  • formula synthesis
  • neural/texture models
  • production 8Z encoder integration
1

Synthetic CA positive control

Known CA raw stream should be recovered as STRUCTURED_POSITIVE.

2

Known CA → zstd/lzma → recovery attempt

This tests whether final-byte compression destroys recoverable generator signal.

3

8Z_Mosaic / Appendix-Q-style residual reproduction

Confirm known structured residual wins under the new harder gates.

4

Codec-internal layers

Move to PNG/FLAC/DEFLATE/LZ token and residual layers.

8. Python arena status

v0.3.3 balanced master passed: harness yes, production compression no

This is the website-facing result layer. The arena writes local diagnostic reports inside each run folder, but public claims belong here only after the live extractor / updater has summarized the run and the results have been deduplicated.

Current interpretation: the scout recovered known generated structure, found representation-layer signals, and kept reported random FPR at zero. This validates the harness and the representation-layer branch. It does not prove that ordinary ZIP/7Z/PNG/FLAC/JPG/MP3 final files can already be made 5–15% smaller.
Runs found

6

6 finished · 0 ongoing or partial.

Candidates tested

2,254,006

Total logged candidate trials visible in summaries / JSONL logs.

Candidate positives

2,616

Candidate-level structured positives. These are not yet unique file-level discoveries.

Synthetic positives

8

Positive controls proving the harness can recover known generated structure.

Final-byte positives

0

Ordinary final compressed byte wins remain unproven.

Max random FPR

0

Reported random false-positive rate in the public summaries.

Reporting cleanup: older visible tables repeated many rows from the same target, especially realish_s12345_numeric_payload_zlib.bin. The page now separates candidate positives from unique/promoted discoveries. The next arena/reporting patch should produce unique_positive_summary.json.

What v0.3.3 proves vs does not prove

ItemStatusPublic interpretation
Selftest + smokePASSBasic engine, controls, and synthetic recovery path are working.
Balanced real-layer panel64/64 completeThe arena can run a real representation-layer panel with heartbeats and bounded logging.
Candidate positives2,616Interesting search signal, but must be deduplicated and replicated before becoming a claim.
Control failures visible30/176Good reason to preserve controls and audit tables; do not hide rejected rows.
Final-byte wins0The hard final-compressed-file claim remains unproven and should stay separate.
Random FPR0 reportedPromising, but keep same-budget, shuffle, block-bootstrap, Markov, and generator-ablation controls.

Run summary

RunModeVerdictProgressCandidatesFPRCandidate +Best visible hit
standard_balanced_4wpanelREAL_REPRESENTATION_LAYER_SIGNAL_REVIEW_REQUIRED64/64 (100.0%)2,250,0000.02,600realish numeric zlib payload · PRED_LINEAR_U8 · xor-prev · 93.95%
smokepanelCODEC_INTERNAL_STRUCTURED_SIGNAL_REVIEW_REQUIRED7/7 (100.0%)420.02synthetic CA rule90 PNG IDAT · 99.76%
selftestpanelSYNTHETIC_RECOVERY_ONLY2/2 (100.0%)410.01synthetic CA rule90 raw stream · 99.88%
quick_panel_4wpanelREAL_REPRESENTATION_LAYER_SIGNAL_REVIEW_REQUIRED32/32 (100.0%)3,8400.010synthetic CA seeded PNG IDAT · 99.76%
quick_smokepanelCODEC_INTERNAL_STRUCTURED_SIGNAL_REVIEW_REQUIRED7/7 (100.0%)420.02synthetic CA rule90 PNG IDAT · 99.76%
selftest duplicatepanelSYNTHETIC_RECOVERY_ONLY2/2 (100.0%)410.01synthetic CA rule90 raw stream · 99.88%

Deduped visible hit families

FamilyLayerBest visible file / caseBest visible savingsGeneratorInterpretation
Synthetic CA rawrawsynthetic_ca_rule90_64k.bin+32,728 B · 99.88%CA_RULE_90 + xorPositive control. The harness can recover known generated structure.
Synthetic CA inside PNG/IDATcodec-internalsynthetic_ca_rule90_png_idat.png+16,344 B · 99.76%CA_RULE_90 + xorPositive control through a codec-internal representation.
Synthetic CA then zlibcodec-internalsynthetic_ca_rule90_then_zlib.bin+16,344 B · 99.76%CA_RULE_90 + xorPositive control showing recoverable generated structure before/inside compressed representation.
Numeric structured payloadscodec-internalrealish_s12345_numeric_payload_zlib.bin+3,448 B · 93.95% in balanced; +12,936 B · 96.97% in quick panelPRED_LINEAR_U8 + xor-prevInteresting representation-layer signal. Needs dedupe, holdout, stronger baselines, and replication.
Public claim level: harness validation + representation-layer signal. Not production compression, not final-byte victory, not proof of BD’s 5–15% forecast yet.

Generated from v0.3.3 balanced results. Max reported random FPR: 0. Control-failure rows visible in prior public summary: 30/176. For forensic analysis, use the live bundle ZIP, not only this public page.

9. Failure modes

The scout must not become a coincidence miner.

False positives

Huge offset/rule/transform searches can find lucky local wins.

Weak baseline trap

If a stronger codec alone wins, do not credit math.

Tiny chunk overfit

Small chunks can create fake victories. Report them only as debug.

π-only overfocus

π is a deterministic shared source, but the tested signal may come from CA rules, mappings, geometry, or layer choice. The arena must measure which part actually carries compression value.

10. Next action

Next: run longer, but in controlled waves.

Yes, this arena should eventually run for days or even weeks. But the right move is not blind brute force across every family, chunk size, transform, and target. The balanced pass already showed the branch is alive; now the goal is to turn candidate-level signal into unique, replicated, control-surviving discoveries.

Do longer runs

Use multi-day compute where it matters: real representation layers, codec-internal residual/token streams, 8Z-internal payloads, and families that survived the balanced panel.

Do not just scale noise

Do not spend weeks mining final compressed byte soup or tiny chunks until dedupe, controls, and reporting are strong enough to avoid fake discoveries.

Seed → bridge → test → result

Seed: v0.3.3 found clean synthetic recovery and representation-layer candidate positives.
Bridge: dedupe + controls + targeted deep search turn candidate positives into evidence.
Test: run staged deep panels with live extracts, unique-positive summaries, and holdout replication.
Result: promote only real layer families that beat fair baselines after search cost and controls.

Recommended run ladder

0

v0.3.4 reporting polish

Add unique_positive_summary.json, dedupe by file/layer/generator/transform/seed, and split candidate positives vs unique positives vs replicated positives. This should happen before week-scale runs.

1

12–24h confirmation run

Run real-layer panel at 250k candidates per target. Keep all controls and live-extract every few hours.

RUN_30_REAL_LAYER_PANEL_NO_PAUSE.bat 4 250000
2

2–3 day deep representation-layer run

Use only real/codec-internal representation layers. Keep top-k logging, same-budget nulls, shuffle, block-bootstrap, Markov, and generator ablation. Avoid final-byte stress except as a separate negative control.

RUN_50_WEEKEND_DEEP_REAL_LAYER_NO_PAUSE.bat 4 500000
3

Weekend / week Atlas-candidate run

Spend the large budget only on families that survive Stage 1–2. Require replicated positives across seeds and holdouts before calling anything an Atlas candidate.

RUN_00_MASTER_ALL_TESTS_NO_PAUSE.bat 4 1000000 250000 weekend 0
4

True-final stress check

Run final compressed bytes separately as a hard negative/stress layer. A zero result is useful because it confirms the paper’s strongest caution: final bytes are the hardest layer.

RUN_40_TRUE_FINAL_STRESS_NO_PAUSE.bat 4 50000 balanced
My recommendation: do one reporting hotfix first, then run RUN_30_REAL_LAYER_PANEL_NO_PAUSE.bat 4 250000. If the same families survive, then launch the weekend/deep run. Longer compute is justified, but only after the arena can tell unique discovery from repeated candidate hits.