mRHP — RHP that improves RHP

AIm³ MentalArena

A human-seed, multi‑LLM reasoning laboratory. BD gives the seed; LLMs amplify; MAL synthesizes; builders build; arenas, data, and evidence decide.

v0.98 benchmark hardening6 Mental Modes4 Mental EnginesAI-ONLY / HUMAN × AIpublic/local splitno API keys on public page27-domain world application mapPrize Frontier · verified 19 Jun 2026MAL → AI8 / Reasojjng

Purpose

Not a chatbot. A reasoning workflow lab.

MentalArena is an external operational learning layer above LLMs. It does not train or modify models. It improves protocols, context packs, scoring, lineage, builder prompts, evidence reports, and next-run seeds.

Core rule: unusual human seeds are not rejected at the idea stage. Objections become assumptions, variants, controls, tests, metrics, risk labels, or implementation warnings.

Mental Modes

Mode chooses the epistemic style.

Innovation

Seed Amplifier

For new concepts and speculative invention. No seed rejection; convert critique into tests.

Builder Prompt

Code Spec

For producing the strongest task-level builder prompt from sources, requirements, and acceptance gates.

Factual

Source-Grounded

For history/science/fact-heavy articles. Source ledger and uncertainty outrank style.

Creative-Realistic

System Upgrade

For MAL/app/protocol upgrades: imaginative but buildable, with constraints and MVP paths.

Artistic

Story / Scene

For Nara, stories, worldbuilding, emotional truth, beauty, voice, and canon.

Audit

Evidence

For ZIP/log/code/result review. Attack claims and evidence, not the human seed.

Mental Engines

Engine chooses how the multi‑LLM brain thinks.

Engine	Purpose	Default use
E1 Flash Council	One independent multi‑LLM round + synthesis.	Quick breadth, cheap baseline.
E2 MAL Standard	Structured council → conflict/coverage → second round → synthesis.	Builder prompts, system upgrades, evidence audit.
E3 Roundtable	Multi‑round brainstorming where models hear the shared room memory.	Innovation, writers’ room, hard architecture.
E4 Synthesis Cascade	Roundtable → synthesis_1 → second roundtable on synthesis_1.	Critical final decisions only.

World application map

Where MAL can create unusually high value.

MentalArena is strongest when a problem needs several genuinely different viewpoints, the cost of a missed error is high, human tacit knowledge matters, and the result can be tested against sources, code, data, experiments, or downstream outcomes.

The full MAL loop:
human seed or frozen task → independent model attempts → disagreement and coverage map → a small high-information human question queue → second round → synthesis or build specification → implementation → external test → next improved iteration.

MAL value rises with:
• viewpoint diversity
• cost of missed error
• external verifiability
• human tacit knowledge
• traceability need
• parallelizable work

It falls with:
• extreme urgency
• privacy exposure
• strictly sequential work
• unverifiable outputs

Evidence status: the 0–100 numbers below are strategic fit estimates, not benchmark results. They state where MAL's architecture appears promising. Domain-specific blind tests must decide whether MAL actually beats a strong single-model baseline, a simple workflow, or a human expert process.

Engineering, AI, and digital systems

Software and AI-system development98 / 100

Best role: turn ambiguous requirements into buildable architecture, implementation gates, and evidence.

Legacy-system upgrade: add a major capability or migrate architecture without silently breaking accepted behavior, file formats, APIs, or release workflows.
Failed-build diagnosis: explain why code passes smoke tests but fails on real datasets, production traffic, edge cases, or performance limits.
Agent/product architecture: decide between a single model, RAG, tools, multi-agent orchestration, or deterministic code, then define the smallest decisive build and eval.

Recommended: Builder Prompt / Audit · E2; E3 for novel architecture

LLM evaluation and AI governance97 / 100

Best role: run blind, cost-aware comparisons and preserve dissent instead of choosing a model by reputation.

Model and workflow selection: choose the best model, RAG stack, or agent workflow for coding, support, analysis, or research under quality, latency, cost, and privacy constraints.
Orchestration ablation: test whether a multi-agent council actually beats one strong model when the seed, sources, builder, and evaluator are frozen.
AI-system risk audit: locate hallucination, prompt-injection, data-leakage, policy, calibration, and human-oversight failures before production promotion.

Recommended: Audit / Evidence · E2 or E4 for final release decisions

Complex technical R&D95 / 100

Best role: keep unconventional ideas alive long enough to convert them into competing mechanisms and kill-tests.

New machine or energy concept: split a speculative mechanism into variants, constraints, failure modes, simulations, and the smallest physical prototype.
Robotics/control architecture: resolve conflicting sensor, actuator, controller, redundancy, power, safety, and cost choices.
Algorithmic invention: explore a large design space, preserve minority hypotheses, and define controls that distinguish a real algorithmic gain from benchmark overfit.

Recommended: Creative-Realistic / Innovation · E3

Technical audit and due diligence94 / 100

Best role: act as an evidence court across claims, source files, code, logs, and tests.

Startup technology claim: verify whether a claimed benchmark, compression gain, AI capability, or trading result survives matched controls and fresh data.
Vendor acceptance: compare contract requirements with the delivered repository, test reports, security posture, and missing evidence.
Research-package forensics: find leakage, lookahead, impossible accounting, open-tail dependence, cherry-picked runs, or absent negative results in a ZIP bundle.

Recommended: Audit / Evidence · E2

Safety engineering93 / 100

Best role: generate independent hazard views, merge them into a safety case, and demand testable mitigations.

FMEA and hazard analysis: identify coupled failures in a battery system, robot, industrial machine, medical device, or autonomous subsystem.
Safety-case construction: connect every safety claim to evidence, assumptions, owner, test, residual risk, and operating boundary.
Recovery design: compare graceful degradation, fail-safe shutdown, redundancy, alarm, and human-takeover strategies under cascading faults.

Recommended: Audit · E2; E3 for architecture redesign

Cybersecurity91 / 100

Best role: coordinate competing attack hypotheses and evidence without autonomously executing defensive actions.

Incident root cause: decide whether conflicting telemetry supports stolen credentials, application exploit, insider action, supply-chain compromise, or false positive.
Threat model: map attack paths through an AI-enabled SaaS product, model supply chain, RAG corpus, tools, identities, and third-party integrations.
Remediation priority: rank fixes by exploitability, blast radius, evidence strength, recovery cost, and dependency rather than by the loudest team opinion.

Recommended: Audit · E2 or E3; humans authorize all actions

Science, health, and evidence-intensive work

Scientific research96 / 100

Best role: generate competing hypotheses, explicit falsifiers, and experiments rather than one polished story.

Unexplained anomaly: produce mechanistically different explanations, list what each predicts, and design the cheapest experiment that separates them.
Replication planning: audit an influential result for hidden degrees of freedom, confounders, missing controls, and a preregistered replication design.
Research-program synthesis: turn a fragmented literature into a claim ledger, unresolved conflicts, prioritized hypotheses, and a sequence of empirical tests.

Recommended: Innovation / Factual · E3

Pharma and biotechnology90 / 100

Best role: support research design and evidence review under expert oversight.

Mechanism and assay design: compare biological mechanisms, biomarkers, assay choices, off-target risks, and experiments that could reject each route.
Contradictory preclinical evidence: reconcile cell, animal, omics, and literature findings while exposing where the datasets genuinely disagree.
Trial-protocol review: challenge endpoints, eligibility, stratification, safety monitoring, operational feasibility, and failure-to-learn risks before launch.

Recommended: Factual / Innovation · E3; domain experts remain accountable

Clinical multidisciplinary decision support84 / 100

Best role: prepare a clinician-led review; never act as an autonomous diagnostician or prescriber.

Complex case conference: organize imaging, pathology, labs, history, guidelines, and specialist disagreements for a tumor board or rare-disease meeting.
Missing-information analysis: identify which additional test, history item, or specialist opinion would most change the current decision.
Guideline conflict: show why recommendations differ across patient subgroups, comorbidities, values, contraindications, and evidence quality.

Recommended: Factual / Audit · E2; licensed clinicians decide

Investigative journalism83 / 100

Best role: build a source and contradiction ledger without inventing missing facts.

Conflicting public claims: map who said what, when, on which evidence, and which claims are unsupported or mutually incompatible.
Timeline and entity network: connect documents, companies, payments, decisions, people, and events while preserving provenance.
Next reporting move: identify the three missing documents, interviews, data requests, or verification steps most likely to resolve the story.

Recommended: Factual · E2; publication requires editorial verification

Patents, grants, and public procurement82 / 100

Best role: apply explicit criteria, preserve minority objections, and expose evidence gaps.

Patent landscape: compare novelty, claim overlap, obviousness arguments, design-around routes, and uncertain prior art.
Grant review: separate scientific importance, methodological quality, feasibility, team capacity, and high-risk/high-reward value.
Procurement decision: compare suppliers on total cost, lock-in, interoperability, security, evidence quality, implementation risk, and hidden exclusions.

Recommended: Audit · E2; legal and procurement officers own decisions

Institutions, law, markets, and operations

Regulation, standards, and compliance92 / 100

Best role: turn broad obligations into a traceable evidence matrix and remediation plan.

Requirement mapping: map an AI or software system against applicable law, standards, internal policy, risk tier, and responsible owner.
Audit-readiness package: connect every control to implementation evidence, test result, exception, residual risk, and review date.
Supplier compliance gap: identify what a vendor proves, merely asserts, omits, or cannot contractually guarantee.

Recommended: Factual / Audit · E2

Business strategy and scenario planning90 / 100

Best role: prevent one attractive strategy narrative from crowding out alternatives and kill-tests.

Competitive response: model how incumbents, low-cost entrants, regulators, suppliers, and customers could react to a new product.
Market-entry decision: compare geographies, segments, pricing, partnerships, operational burden, and downside scenarios.
Pre-mortem: identify the smallest early signals that would show a major investment, acquisition, or product thesis is wrong.

Recommended: Creative-Realistic · E3

Public policy and government88 / 100

Best role: prepare transparent options and tests; democratic legitimacy remains human and institutional.

Housing-policy design: compare affordability, supply, tenant protection, fiscal cost, land use, legal constraints, and second-order effects.
Transport or health intervention: combine citizen, operator, budget, equity, safety, and implementation perspectives into pilotable options.
Policy evaluation: define who should benefit, measurable outcomes, counterfactuals, data needs, failure criteria, and an adaptation path.

Recommended: Factual · E2 or E3; public accountability stays human

Finance and investment research87 / 100

Best role: function as a research and risk committee, not an autonomous trading or investment executor.

Bull/bear investment case: independently analyze earnings quality, cash flow, balance sheet, competition, valuation, catalysts, and failure modes.
M&A due diligence: challenge revenue quality, customer concentration, technology claims, liabilities, integration risk, and downside scenarios.
Model/backtest risk: search for leakage, overfit, unrealistic fills, regime dependence, hidden leverage, open-tail exposure, and weak controls.

Recommended: Audit · E2; no autonomous execution

Law and contracts87 / 100

Best role: support issue spotting and adversarial review; qualified counsel remains responsible.

Contract ambiguity: expose conflicting definitions, liability gaps, termination traps, data rights, service levels, and unenforceable assumptions.
Litigation preparation: build both sides' strongest arguments, evidence dependencies, credibility risks, and unresolved factual questions.
Regulatory interpretation: compare plausible readings, jurisdictional differences, precedent, operational consequences, and questions requiring formal advice.

Recommended: Factual / Audit · E2; lawyer review required

Energy, climate, and infrastructure86 / 100

Best role: integrate engineering, economics, environment, resilience, and public constraints.

Grid and storage plan: compare generation, storage, transmission, demand response, reliability, cost, permitting, and extreme-weather performance.
Climate-adaptation portfolio: prioritize flood, heat, water, wildfire, coastal, and health measures under uncertain budgets and futures.
Infrastructure siting: compare routes or sites on lifecycle cost, safety, ecology, community impact, construction risk, and long-term resilience.

Recommended: E3 exploration → E2 evidence audit

Supply chains and operations84 / 100

Best role: separate symptoms from root causes and test policies across disruption scenarios.

Repeated delivery failure: distinguish forecast error, supplier quality, capacity, scheduling, logistics, incentives, and data problems.
Inventory policy: compare service level, working capital, perishability, volatility, lead time, substitution, and shortage cost.
Supplier resilience: identify single points of failure, hidden tier dependencies, recovery options, dual-source tradeoffs, and trigger thresholds.

Recommended: E2

Product development and UX82 / 100

Best role: synthesize conflicting research into testable product choices rather than a feature wish list.

Conflicting user evidence: separate real needs, stated preferences, segment differences, accessibility needs, and research artifacts.
MVP definition: choose the smallest feature set that tests the core value proposition with explicit success and failure criteria.
Edge-case audit: examine onboarding, failure recovery, mobile behavior, accessibility, trust, privacy, and high-stress user journeys.

Recommended: E2; E3 for new product concepts

People, learning, creativity, and society

Education and curriculum80 / 100

Best role: combine subject expertise, pedagogy, assessment, accessibility, and teacher judgment.

Curriculum redesign: sequence concepts for mixed prior knowledge while preserving depth, practical transfer, and prerequisite structure.
Assessment design: measure reasoning and application rather than rote recall or easy AI-generated output.
Learning-gap intervention: identify why a learner is stuck and propose targeted exercises, explanations, and teacher checkpoints.

Recommended: E2 or E3; teachers remain accountable

Film, books, games, and writers' rooms80 / 100

Best role: preserve a distinctive human vision while using independent creative lenses and continuity checks.

Story repair: fix a weak second act, predictable twist, flat antagonist, or emotional arc without losing the core theme.
Franchise continuity: reconcile canon, character voice, world rules, timeline, visual motifs, and future sequel constraints.
Game narrative and mechanics: align player agency, progression, difficulty, economy, story, production cost, and replay value.

Recommended: Artistic · E3; E4 for flagship final synthesis

Humanitarian and crisis planning79 / 100

Best role: support local experts with scenario breadth and resource tradeoffs; never treat affected communities as mere data.

Aid allocation: compare urgency, access, vulnerability, logistics, security, equity, local capacity, and uncertainty.
Evacuation and shelter plan: test routes, transport, medical needs, communication failures, weather, congestion, and special populations.
Response coordination: reconcile government, NGO, community, medical, logistics, and donor constraints into a staged plan.

Recommended: E3 + local human experts and accountable authorities

Agriculture and food systems77 / 100

Best role: integrate local practice with agronomy, weather, economics, logistics, and environmental constraints.

Crop strategy: compare crop mix, soil, water, weather risk, input cost, labor, market demand, and rotation.
Pest or disease response: distinguish likely causes, surveillance needs, intervention options, resistance risk, and economic thresholds.
Food-chain loss: reduce spoilage and waste across harvest timing, storage, cold chain, transport, processing, and demand planning.

Recommended: E2 or E3; local agronomists and farmers lead

HR and organizational design64 / 100

Best role: improve systems and roles; not automate hiring, firing, promotion, or disciplinary decisions.

Decision bottleneck: identify unclear authority, duplicated approvals, missing ownership, information delay, and incentive conflict.
Role redesign: define responsibilities, interfaces, escalation rules, success measures, and transition risks during growth or reorganization.
Organizational postmortem: separate individual blame from process, workload, tooling, communication, and governance failures.

Recommended: E2 + human review; no automated employment decisions

Major personal decisions61 / 100

Best role: clarify values, assumptions, options, and experiments; the person retains responsibility.

Career change: compare meaning, income, risk, skill gap, location, family effects, reversibility, and a low-cost trial.
Relocation or education choice: structure financial, social, legal, health, opportunity, and long-term tradeoffs.
Care decision: organize family preferences, medical information, practical support, cost, uncertainty, and questions for professionals.

Recommended: HUMAN × AI · E2; not a substitute for professional advice

Where MAL should not be the default

Routine support and simple summaries30 / 100

Better tool: one model, search, or deterministic software. MAL adds avoidable cost and latency.

Simple transformation: translate a page, rewrite an email, format a document, or summarize a short supplied text.
Deterministic task: calculate a value, validate a schema, rename files, or convert a known data format.
Routine classification: route a standard support ticket, answer a stable FAQ, or extract fields from a familiar template.

Recommended: single model, E1 at most, or ordinary code

Real-time and safety-critical control15 / 100

Better tool: deterministic, tested, certified control systems. MAL may assist design and audit, never the control loop.

Immediate physical control: flight control, industrial interlock, robot emergency stop, or power-grid protective relay.
Millisecond execution: high-frequency order routing, collision avoidance, or latency-bound actuator control.
Emergency clinical signal: life-critical alarm, dosage actuation, ventilator control, or autonomous treatment decision.

Recommended: MAL only for architecture, simulation, red-team, and evidence review

Likely product families

MAL Build

Software and system engineering

From source bundle and human seed to architecture, builder prompt, implementation gates, code audit, and release evidence.

MAL Research Council

Hypotheses and experiments

Competing mechanisms, falsifiers, literature conflicts, expert questions, experimental design, and next-run learning.

MAL Evidence Court

Audit and due diligence

Claim ledger, source ledger, dissent, missing evidence, controls, acceptance matrix, and go/hold/reject verdict.

MAL AI Evaluation Lab

Models and workflows

Frozen-seed comparisons, blind candidates, cost ledgers, evaluator matrices, safety gates, and routing decisions.

MAL Compliance Council

Requirements to proof

Legal and standards mapping, control ownership, evidence packages, exceptions, remediation, and audit trail.

MAL Writers' Room

Stories and worlds

Independent creative lenses, dialogue, structure, canon, emotional arc, production realism, and human final vision.

Global rollout lens

European Union

Governance-heavy adoption

Strong fit for compliance, public-sector analysis, healthcare research, finance, industrial safety, and multilingual evidence trails.

North America and UK

Engineering and diligence

Strong fit for software, biotech, cybersecurity, venture/M&A technical review, AI evaluation, and professional services.

East Asia

Industrial systems

Strong fit for manufacturing quality, robotics, semiconductors, supply chains, technical R&D, and reliability engineering.

Emerging markets and international organizations

Affordable multidisciplinary council

Potentially valuable where small teams need broad expertise—provided local people are decision partners, not merely data sources.

Honest positioning: MAL is not the first multi-agent system. Its intended differentiation is the combination of a human seed, multiple providers, independent first attempts, preserved disagreement, selective human questions, builder handoff, external evidence gates, cost tracking, and iterative memory. The goal is not “more agents.” The goal is a better decision-and-evidence loop.

MAL Prize Frontier

Use real prize challenges as external judges.

After MAL-on-MAL, the Trading Arena ablation, and ARC‑AGI‑3, the strongest next move is not another internal demo. It is a standing frontier of externally defined problems with deadlines, judges, rules, and money at stake. The prize is useful; the independent evaluation is more valuable.

Recommended sequence — with a continuous radar

Step 0

MAL improves MAL

Use v0.98 to diagnose and build the smallest credible v1.0, then lock the testable version.

Step 1

Trading A–F benchmark

Compare direct prompting, E1–E4, and manual gold reference using the same seed, builder, data, and evidence gates.

Step 2

ARC‑AGI‑3

Move from prompt quality to interactive agents that must explore, model, set goals, plan, act, and correct themselves.

Step 3

Prize Frontier

Run fast eligibility and kill-tests across software, forests, water, health, energy, logistics, and hardware.

Step 4

Cross-domain proof

Publish what transferred, what failed, cost per useful result, and which pieces should move into AI8 / Reasojjng.

Important scheduling correction: the roadmap is not fully sequential. The Prize Frontier should run continuously in lightweight triage mode while the core benchmarks continue. A good challenge should not be missed merely because its deadline arrives before ARC‑AGI‑3 is finished.

MAL opportunity score ≈ problem fit × eligibility × data access × prototype speed × evidence value × strategic learning × prize value ÷ (time × capital × partner risk × execution risk)

Priority lane — plausible near-term targets

ARC Prize 2026 — ARC‑AGI‑3

DirectMilestones 30 Jun / 30 SepFinal 2 Nov

$850K track
$2M program

Problem: build an offline agent that enters novel interactive environments without instructions and learns what to do through efficient exploration and action.

MAL attack

Generate genuinely different agent architectures, preserve dissent about memory and search, build the smallest local simulator, then use ablations to determine which modules add score rather than narrative complexity.

First decisive test: can one local, reproducible agent beat random and hand-coded baselines on unseen developer environments under the same compute budget?

Official ARC‑AGI‑3 rules ↗

3D Surface Fuels & Vegetation Modeling

Direct in principleInitial submission 20 Jul

$85,100

Problem: transform sensor or remote-sensing inputs into usable georeferenced surface-fuel and understory maps, with 3D outputs, validation, visualisation, and a Python ingestion tool.

MAL attack

Run a geospatial, forestry, computer-vision, uncertainty, and validation council; compare LiDAR, photogrammetry, multispectral, and public-data paths; then build one end-to-end reproducible pipeline rather than a broad concept paper.

First decisive test: can public or obtainable data yield one scientifically defensible 100 m × 100 m mapped site in an accepted output format? Rules admit adults from the U.S. or a NATO-affiliated ally, subject to verification.

Official challenge ↗

RELX Environmental Challenge

Global · 21+12 Jul · 23:59 GMT

2 × $75,000

Problem: propose a practical, scalable solution either for safe water and sanitation or for ocean health, ecosystem protection, sustainability, and resilience.

MAL attack

Start with a specific community and measurable failure, combine technical design with economics, equity, local adoption, and replication, then force the council to produce a pilot plan and falsifiable impact model.

First decisive test: can we identify a real local partner, baseline metric, and pilot that can start without winning the prize?

Official 2026 rules ↗

Forest & Nature Tech Pitching Competition

Startup / R2B team11 Sep

€5,000

Problem: develop a validated, commercially credible solution for forests, biodiversity, sustainable materials, monitoring, soils, or nature data.

MAL attack

Use MAL as a virtual founding team: choose one narrow pain point, map users and buyers, design the minimum prototype, pressure-test scientific validity, and build a clear path from research to field adoption.

First decisive test: do we have or can we create a small validated demonstration and a credible early-stage team before September?

Official competition ↗

Partner lane — high-value challenges that need an eligible organisation or specialist build team

EU Prize for Governance Innovations in Energy Communities

Eligible energy-community partner25 Jun

€1,000,000

Problem: demonstrate inclusive, transparent, effective governance and fair benefit sharing in an existing EU renewable or citizen energy community.

MAL role

Rapidly audit governance, participation, financing, benefit sharing, local-policy alignment, and evidence; turn the best existing practice into a judge-aligned submission without inventing results.

Gate before any work: an already eligible energy community with up to 10,000 members must agree to apply immediately. Without that partner, skip this cycle.

European Commission prize ↗

HHS Digital Stockpile & Manufacturing Response Network

Manufacturing / logistics team28 Aug

Up to $2.04M

Problem: design a resilient system that can rapidly produce and distribute critical medical supplies during public-health emergencies and supply-chain disruption.

MAL role

Build independent architectures for digital inventory, distributed manufacturing, qualification, cyber-security, logistics, and emergency command; then stress them against disruption scenarios and convert the winner into an eight-page design, blueprint, and pitch.

Gate before any work: secure manufacturing, emergency-response, and U.S. eligibility expertise; this is not a credible solo paper exercise.

Official challenge summary ↗

NASA TechLeap — Robotically Manipulated Payload

Aerospace hardware teamRegister 29 Jul · apply 12 Aug

Up to 3 × $500,000

Problem: create a payload that can be manipulated, reconfigured, or serviced by a robotic arm in low Earth orbit, then mature it to flight readiness.

MAL role

Run systems-engineering, robotics, payload, thermal, power, interfaces, reliability, FMEA, schedule, and test-plan lenses; MAL should narrow the concept before expensive hardware begins.

Gate before any work: find an eligible aerospace partner that can own payload fabrication, environmental testing, and flight readiness.

Official NASA challenge ↗

NIH Nutrition Training in Health-Care Education

U.S. accredited institution only15 Sep

Up to $2.1M

Problem: identify or develop scalable, evidence-based curricula that integrate nutrition into medical, residency, or nursing education.

MAL role

Map the competency framework, curriculum gaps, pedagogy, clinical integration, implementation barriers, outcome measures, and national scaling plan while preserving expert and learner disagreement.

Gate before any work: an eligible U.S.-based accredited medical, residency, or nursing institution must own the submission.

Official NIH challenge ↗

Portfolio rule: never choose a prize only because the purse is large. Keep one computational target, one real-world environmental or social target, and at most one partner-heavy moonshot. Every entry needs a no-go gate, a fixed budget, an evidence plan, and a reusable artifact even when it loses.

Human collaboration

AI-ONLY by default. HUMAN × AI when BD's intuition can unlock the search.

AI-ONLY

The council uses the initial BD seed and completes its workflow without a mid-run human question gate. This remains the clean control for ablation tests.

HUMAN × AI

Models first make their own best attempt, then may submit precise ASK_BD questions. MAL deduplicates them, pauses for a small high-information queue, records answers or skips, accepts unsolicited BD seeds, and carries that memory into later rounds.

Evidence boundary: human input is high-priority project intent, not proof. Contribution ledgers and an optional AI-only delta help attribution; tests, sources, arenas, and downstream results still decide.

Communication transport

TEXT by default. Structured JSON when useful.

TEXT

Natural-language council communication remains the default and is best for nuance, exploration, explanation, and artistic work.

JSON

Optional compact stage schemas improve inter-LLM structure, comparison, and local parsing. Valid sidecars feed later rounds; malformed JSON falls back to the preserved raw answer.

Human boundary: final synthesis is always normal readable text. JSON is an intermediate transport, not the public deliverable and not a guarantee of lower token cost.

Benchmark doctrine

Trading v1.4 is the proof test.

The benchmark freezes the initial prompt-builder seed, required source files, and final GPT‑5.5 Pro builder, then varies only the prompt-generation workflow.

Arm	Workflow	Meaning
A	Direct IPB baseline	Calibrated null: does orchestration help at all?
B	MAL E1 Flash Council	One multi‑LLM round.
C	MAL E2 Standard	Current structured MAL workflow.
D	MAL E3 Roundtable Light	Single-cycle multi‑LLM brainstorming.
E	MAL E4 Synthesis Cascade Deep	Premium deep engine, cost-gated.
F	Manual RHPr/RHP Gold Reference	Expert upper bound, not a same-cost competitor.

See the v0.98/v0.6 benchmark page.

Evidence hygiene

v0.98 hardens the benchmark.

Manifest and hashes

Benchmark manifest records IPB/source/prompt/candidate hashes, anonymized candidate map, run order, builder model, and timestamps.

Cost/benefit ledger

Tracks LLM calls, token estimates, wall time, BD effort, prompt score, code score, evidence score, and score-per-cost.

No-promotion gates

Controls, event ledger, force-close-end PnL, open-tail report, and strategy fidelity are required before promotion.

External eval export

Exports promptfoo-like and candidate evaluation matrices for later comparison with standard eval tooling.

Public/private boundary

This public page contains no API-key UI.

Local-only tools such as provider API keys, provider status, chat logs, and local run controls belong only inside local_lab/AIM3_MentalArena_Lab.html served from localhost. They are intentionally absent from this public CRP page.

Context, comparison, and destination

MAL is a bridge—not the final architecture.

MAL should be judged against both simpler single-model workflows and the strongest external multi-agent research. It should also be understood correctly inside this project: MAL is an external reasoning laboratory and a stepping stone toward the proposed AI8 / Reasojjng architecture, not the endpoint itself.

MAL's strongest advantages and honest weaknesses

Where MAL can be stronger

Multi-provider diversity rather than one model family.
Independent first attempts before models influence one another.
Preserved dissent and an explicit test that can resolve it.
Human-seed amplification plus selective high-information questions.
Builder handoff, cost ledger, lineage, controls, and external evidence gates.
One workflow that can move from software to science, policy, nature, or art.

Where MAL is currently weaker

More latency, token cost, orchestration work, and failure surfaces than a strong single model.
Correlated models can repeat the same error and create false confidence.
Coordination can damage strictly sequential tasks.
Tool use, retrieval, long-horizon execution, and production reliability are less mature than leading deployed agent systems.
Its memory is external workflow memory, not yet integrated learning inside one persistent architecture.
MAL does not itself guarantee truth, safety, legal compliance, or useful real-world impact.

MAL compared with the wider landscape

Reference	What it has over MAL today	What MAL is trying to add	Lesson for us
One strong model / simple workflow	Lower cost, lower latency, clearer state, and often better coherence on sequential work.	Diversity, dissent, human questions, ablation arms, and an evidence court when one answer is not enough.	E1 or one model must remain the default null. Complexity has to earn promotion.
Google Research — scaling agent systems	Controlled evidence across 180 configurations and task-dependent scaling principles.	A practical multi-provider laboratory that routes modes, includes the human seed, and carries results into build-and-test loops.	Parallelisable problems favour councils; sequential problems may need one lead agent and a later audit.
Google AI Co-Scientist	Science-specific hypothesis generation, ranking, and research collaboration built around a powerful unified model ecosystem.	Broader domain coverage, explicit cross-provider disagreement, selective ASK_BD / ASK_EXPERT questions, and downstream builder/evidence stages.	For science, MAL needs stronger literature provenance, experiment tooling, and domain-expert gates.
Anthropic multi-agent research system	Mature parallel web/tool research, production infrastructure, delegation, and context engineering.	Provider diversity, blind workflow comparison, contribution ledgers, preserved minority views, and human-seeded system invention.	We need better tool reliability, budget routing, recovery, and long-running execution—not simply more model calls.
NIST AI RMF	A cross-sector risk-management framework and common trustworthiness language.	An operational workflow that can turn risk questions into roles, evidence requests, tests, owners, and release gates.	MAL can help implement a risk process; it cannot certify itself as trustworthy.
EU AI Act	Binding legal obligations, governance, transparency, and enforcement in the EU.	Traceable decisions, source and claim ledgers, human gates, and compliance evidence packs.	Workflow traceability supports compliance but does not replace legal classification or counsel.
WHO AI-health guidance	Health-specific ethics, governance, human responsibility, and stakeholder protections.	Multidisciplinary deliberation, explicit missing evidence, and expert-question routing.	Clinical use must remain expert-led, validated, and bounded; MAL is decision support, not an autonomous clinician.
OECD AI in government	Cross-country evidence on public-sector adoption, enabling conditions, guardrails, and engagement.	A repeatable policy council that keeps stakeholder conflict, implementation, auditability, and local human input visible.	Public legitimacy and accountable authority cannot be delegated to an LLM council.

The bridge from MAL to AI8 / Reasojjng

MAL today

External collective cognition

Several LLMs, a human, source files, shared memory, builders, tests, and evidence are coordinated as separate components.

→

ARC‑AGI‑3 bridge

Interactive competence

The agent must actively explore, build a world model, infer or select goals, plan, execute, and course-correct under feedback.

→

AI8 / Reasojjng proposal

Integrated persistent architecture

Internalise specialised reasoning, DCC-like routing, durable memory, active perception, goal management, action, self-critique, skill accumulation, and a human value interface.

The intended direction: MAL is the scaffold on which we can discover which cognitive roles, conflict mechanisms, memory structures, evidence gates, and human interactions are actually useful. ARC‑AGI‑3 is especially important because it pressures the system beyond static answers into exploration, modeling, goal-setting, planning, execution, and correction. The long-term AI8 / Reasojjng claim remains a research proposal until an integrated system is built and independently tested.

Official context sources

These sources do not prove MAL's superiority. They define the external evidence, engineering, and governance landscape against which MAL should be measured.