Skip to content

OpenEnvision-Lab/OpenMM-Arena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

OpenMM-Arena Logo

OpenMM-Arena

A Comprehensive Compendium for Multimodal Artificial Intelligence Research

Awesome License: MIT PRs Welcome Website GitHub stars GitHub forks GitHub watchers

Papers Models Benchmarks Datasets Research Pillars Last Updated

The definitive, taxonomically organized knowledge base that unifies seven converging research frontiers of multimodal AI — spanning 2000+ papers, 46 arena-ranked models, 100+ benchmarks, and 90+ datasets — providing researchers with a single, authoritative cross-domain reference.

Explore the Website · Arena Leaderboard · Star on GitHub


If you find this resource useful, please give us ato help others discover it!


Overview

The field of multimodal artificial intelligence has witnessed an unprecedented convergence of historically independent research traditions. OpenMM-Arena serves as the centralized knowledge base that organizes, maps, bridges, and catalogues this expansive landscape across seven research pillars:

                              ┌──────────────────────────────────────┐
                              │         OpenMM-Arena                 │
                              │   Multimodal AI Research Compendium  │
                              └──────────────┬───────────────────────┘
                                             │
              ┌──────────┬──────────┬────────┼────────┬──────────┬──────────┐
              ▼          ▼          ▼        ▼        ▼          ▼          ▼
         ┌─────────┐┌─────────┐┌────────┐┌──────┐┌──────┐┌──────────┐┌────────┐
         │  T2I    ││  T2V    ││  I2V   ││  3D  ││  4D  ││ Unified  ││ World  │
         │ 250+    ││ 200+    ││ 170+   ││ 500+ ││ 500+ ││ 120+     ││ 200+   │
         │ papers  ││ papers  ││ papers ││papers││papers││ models   ││ papers │
         └────┬────┘└────┬────┘└───┬────┘└──┬───┘└──┬───┘└────┬─────┘└───┬────┘
              │          │         │        │       │         │          │
              ▼          ▼         ▼        ▼       ▼         ▼          ▼
          Models     Foundation  Animation  3DGS   Depth   Diffusion   Theory
          Control    Controllable Editing   NeRF   Tracking  AR        Games
          Editing    Benchmarks  Portraits  SLAM   Recon    Hybrid     Driving
          Safety     Datasets    Transfer   LLM-3D Dynamic  Any2Any   Embodied
          Arena                             Robot  Human    Eval       Sim
          Benchmarks                        Nav    Physics  Datasets   MBRL

2000+ Papers · 7 Research Pillars · 46 Arena-Ranked Models · 100+ Benchmarks · 90+ Datasets · 11 Source Repositories


Seven Research Pillars

250+ papers · 46 arena-ranked models · 6 sub-domains

The progressive evolution of text-conditioned visual synthesis — from GANs through diffusion models to autoregressive transformers. Covers foundational models (FLUX.2, Seedream, Imagen 3), controllable generation (ControlNet, GLIGEN), text-guided editing (InstructPix2Pix, DreamBooth), safety & bias, and comprehensive arena leaderboards with 3.9M+ human votes.

Sub-domains: Models & Face Synthesis · Control & Composition · Editing & Personalization · Safety & Applications · Cross-Modal Extensions · Arena & Benchmarks


200+ papers · 25+ datasets · 3 sub-domains

Foundation text-to-video models from GANs to diffusion transformers — covering Sora, Wan 2.1, Veo 2, CogVideoX, and the latest controllable and efficient video synthesis approaches. Includes long video generation, temporal coherence, and large-scale video-text datasets.

Sub-domains: Foundation T2V Models · Controllable Synthesis · Benchmarks & Datasets


170+ papers · 20+ applications · 2 sub-domains

Image animation, character-driven video synthesis, talking head generation, video editing, motion transfer, audio synthesis for video, and enhancement techniques. Bridges static imagery with temporal dynamics through learned motion priors. (Covered within the T2V deep-dive document.)

Sub-domains: Animation & Portraits · Video Editing


500+ papers · 6 sub-domains

3D Gaussian Splatting, Neural Radiance Fields (NeRF), text/image-to-3D generation, LLM-3D understanding, NeRF-SLAM, GS-SLAM, visual/LiDAR SLAM, robotics manipulation, navigation, and localization. Extends generative AI into the spatial domain.

Sub-domains: 3D Gaussian Splatting · NeRF · NeRF-SLAM · GS-SLAM · LLM-3D · Robotics


500+ papers · 6 sub-domains

The temporal dimension of 3D understanding — depth & camera pose estimation, dense 3D/4D point tracking, scene reconstruction, 4D dynamic scenes via deformable NeRFs and 4DGS, human-centric motion capture, and physics-based simulation.

Sub-domains: Geometry & Depth · 3D/4D Tracking · Reconstruction · Dynamic Scenes · Human-Centric · Physics-Based


120+ models · 30+ benchmarks · 3 sub-domains

Architectures that jointly understand and generate across modalities — diffusion-based unified models, autoregressive MLLMs with pixel & semantic encoding, hybrid AR-diffusion architectures, and any-to-any systems. Plus evaluation benchmarks and training corpora.

Sub-domains: Models & Architectures · Datasets & Training Corpora · Evaluation & Benchmarks


200+ papers · 6 theory themes · 6 sub-domains

Structured environment dynamics for game simulation, autonomous driving, embodied manipulation, model-based RL, video generation as world simulation, and theoretical foundations. From Ha & Schmidhuber to Sora, Cosmos, and Genie.

Sub-domains: Theory & Surveys · Game Simulation · Video Generation · LiDAR Generation · Occupancy Generation · Embodied AI


Arena Leaderboard Spotlight

Source: LM Arena · 3.9M+ human preference votes · 46 models · Updated February 2026

Rank Model Organization Elo Score License
1 GPT-Image-1.5 OpenAI 1248 Proprietary
2 Gemini-3-Pro Image 2K Google 1237 Proprietary
3 Gemini-3-Pro Image Google 1233 Proprietary
4 Grok Imagine Image xAI 1174 Proprietary
5 FLUX.2 Max Black Forest Labs 1169 Proprietary
6 Grok Imagine Image Pro xAI 1166 Proprietary
7 FLUX.2 Flex Black Forest Labs 1158 Proprietary
8 Gemini 2.5 Flash Image Google 1157 Proprietary
9 FLUX.2 Pro Black Forest Labs 1156 Proprietary
10 HunyuanImage 3.0 Tencent 1151 Community

View Full Leaderboard (46 Models) →

Key Insights

Observation Detail
Proprietary Dominance Top 5 held by OpenAI, Google, and xAI
Rapid Progress GPT-Image-1.5 (1248) vs GPT-Image-1 (1115) — 133 Elo improvement in one family
Open-Weight Closing Gap Qwen-Image (Apache 2.0, Elo 1139) at rank 15; FLUX.2 Klein 4B (Apache 2.0, Elo 1021)
Chinese Model Rise Alibaba, ByteDance, Tencent occupy multiple top-20 positions

Documentation Structure

OpenMM-Arena/
├── README.md                    ← You are here (main overview)
├── docs/
│   ├── T2I.md                   ← Text-to-Image deep dive
│   ├── T2V.md                   ← Text-to-Video deep dive
│   ├── 3D_VISION.md             ← 3D Vision deep dive
│   ├── 4D_VISION.md             ← 4D Spatial Intelligence deep dive
│   ├── UNIFIED.md               ← Unified Multimodal Models deep dive
│   ├── WORLD_MODELS.md          ← World Models deep dive
│   ├── ARENA.md                 ← Arena Leaderboard & Rankings
│   │
│   ├── index.html               ← Website landing page
│   ├── style.css                ← Design system
│   ├── script.js                ← Interactive features
│   ├── t2i.html / t2v.html ... ← Pillar hub pages
│   ├── t2i/ t2v/ i2v/ ...      ← Sub-domain pages
│   └── logo.png                 ← Project logo
└── logo.png
Document Content Papers
T2I.md Foundational models, face synthesis, controllable generation, editing, safety, cross-modal extensions 250+
T2V.md Foundation T2V, controllable synthesis, I2V animation, video editing, benchmarks 370+
3D_VISION.md 3DGS, NeRF, SLAM, LLM-3D understanding, robotics 500+
4D_VISION.md Depth estimation, tracking, reconstruction, dynamic scenes, human motion, physics 500+
UNIFIED.md Diffusion, autoregressive, hybrid, any-to-any unified models, evaluation 120+
WORLD_MODELS.md Theory, game sim, driving, video gen, embodied AI, MBRL 200+
ARENA.md LM Arena & Artificial Analysis leaderboards, 46 ranked models, metrics 46 models

Benchmarks & Metrics

Evaluation Benchmarks

Category Key Benchmarks
Multimodal Understanding General-Bench, MMMU, MM-Vet v2, MMBench, SEED-Bench-2, GQA
Image Generation GenExam, KRIS-Bench, WISE, DreamBench++, T2I-CompBench++, GenAI-Bench, TIFA, HEIM
World Models WorldScore, WorldSimBench, PhyWorld, Newton, WorldGym, EWMBench
Interleaved Generation UniBench, OpenING, ISG, MMIE
Arena Rankings LM Arena T2I (3.9M votes), Artificial Analysis T2I (15 categories)

Quantitative Metrics

Metric Full Name Primary Domain
FID Frechet Inception Distance Image Generation
CLIP Score CLIP-based Alignment Score T2I Alignment
VQAScore VQA-based Score T2I Faithfulness
HPSv2 Human Preference Score v2 Human Preference
ImageReward Image Reward Human Preference
LPIPS Learned Perceptual Image Patch Similarity Image Quality
TIFA Text-to-Image Faithfulness Assessment Faithfulness
DSG Davidsonian Scene Graph Compositionality
SSIM Structural Similarity Index Image Quality

Datasets

90+ datasets across all modalities

Category Key Datasets Scale
Multimodal Understanding LAION-5B, DataComp, Infinity-MM, Cambrian-10M 5.9B → 10M
Text-to-Image LAION-Aesthetics, PixelProse, PD12M, CC-12M, SAM 120M → 11M
Image Editing ByteMorph-6M, UltraEdit, AnyEdit, ImgEdit 6M → 1.2M
Interleaved OmniCorpus (8B), OBELICS (141M), Multimodal C4 (101M) 8B → 101M
Video-Text WebVid-10M, InternVid, HD-VILA-100M, Panda-70M 100M → 10M

Source Repositories

OpenMM-Arena systematically consolidates knowledge from 11 foundational open-source repositories:

Repository Domain Maintainer
Awesome-Text-to-Image Text-to-Image Yutong Zhou et al.
Awesome-T2V-Generation Text-to-Video soraw-ai
Awesome-Video-Diffusion Video Diffusion ShowLab
Awesome-3DGS 3D Gaussian Splatting MrNeRF
Awesome-3DReconstruction 3D Reconstruction OpenMVG
Awesome-LLM-3D LLM-3D ActiveVisionLab
Awesome-3D-Vision 3D Vision Hardy-Uint
Awesome-NeRF-3DGS-SLAM NeRF & 3DGS SLAM 3D-Vision-World
Awesome-4D-SI 4D Spatial Intelligence Yukang Cao
Awesome-World-Models World Models Siqiao Huang et al.
Awesome-Unified-MM Unified Multimodal AIDC-AI / Zhang et al.

Features

Feature Description
2000+ Papers The most comprehensive multimodal AI paper collection, meticulously organized with citations and venues
Taxonomic Organization Navigate from high-level pillars through sub-domains to individual papers
Arena Leaderboards Real-time rankings from 3.9M+ human preference votes with Elo scores
Smart Search & Filter Real-time search, sortable tables, year-based filtering, keyboard shortcuts
Dark Mode Full dark mode support with carefully designed color scheme
Responsive Design Optimized for desktop, tablet, and mobile browsing
Continuously Updated New papers from CVPR, ICLR, NeurIPS, ECCV, and arXiv added regularly
Open Source MIT licensed, community-driven, contributions welcome

Quick Start

Browse online: Visit openenvision-lab.github.io/OpenMM-Arena

Run locally:

git clone https://github.com/OpenEnvision-Lab/OpenMM-Arena.git
cd OpenMM-Arena/docs
# Open index.html in your browser, or:
python -m http.server 8000
# Then visit http://localhost:8000

Keyboard shortcuts (on the website):

Key Action
Cmd/Ctrl + K Search papers
D Toggle dark mode
[ Toggle sidebar
? Show all shortcuts

Contributing

We welcome contributions from the research community! You can help by:

  • Adding papers: Submit a PR with new papers in any of the 7 pillars
  • Updating leaderboards: Help keep arena rankings current
  • Fixing errors: Report or fix citation errors, broken links, or categorization issues
  • Suggesting improvements: Open an issue with ideas for better organization
  • Sharing: Star the repo, share on Twitter/X, Reddit, or WeChat to help others discover this resource

How to Spread the Word

If you find OpenMM-Arena useful, consider:

  1. Star this repository to boost its visibility on GitHub Trending
  2. Share on social media — mention @OpenEnvision and use #OpenMM_Arena #MultimodalAI
  3. Cite in your papers — see the Citation section below
  4. Post on research forums — Reddit r/MachineLearning, Hacker News, Papers With Code
  5. Add to your awesome list — if you maintain a related list, link to us

Recommended GitHub Topics

If you maintain this repo, please add these topics in the GitHub repository settings for maximum discoverability:

multimodal-ai text-to-image text-to-video image-to-video 3d-vision 4d-vision world-models diffusion-models generative-ai computer-vision deep-learning awesome-list neural-radiance-fields gaussian-splatting research-papers benchmark leaderboard arena survey paper-collection


Citation

If you find OpenMM-Arena useful for your research, please consider citing:

@misc{openmmarena2025,
    title     = {OpenMM-Arena: A Comprehensive Compendium for Multimodal AI Research},
    author    = {OpenMM-Arena Contributors},
    year      = {2025},
    journal   = {GitHub repository},
    url       = {https://github.com/OpenEnvision-Lab/OpenMM-Arena}
}
Cite foundational works
@inproceedings{zhou2023vision,
  title     = {Vision + Language Applications: A Survey},
  author    = {Zhou, Yutong and Shimada, Nobutaka},
  booktitle = {CVPRW},
  year      = {2023}
}

@misc{huang2025awesomeworldmodels,
  title  = {Awesome-World-Models},
  author = {Siqiao Huang},
  year   = {2025},
  url    = {https://github.com/knightnemo/Awesome-World-Models}
}

@article{zhang2025unified,
  title   = {Unified Multimodal Understanding and Generation Models},
  author  = {Zhang, Xinjie and others},
  journal = {arXiv preprint arXiv:2505.02567},
  year    = {2025}
}

OpenMM-Arena belongs to OpenEnvision Lab

Curated with scholarly rigor for the multimodal AI research community

Last updated: February 2026

Website · GitHub · Issues