Detecting AI-Generated Video

A Vision-Language Dual-View Survey for factual fidelity verification

Dylan Xinming Hou1, Juntian Zhang2, Xu Gu2, Yichen Wu3, Nils Lukas1, Gus Xia1, Xiuying Chen1, Yuhan Liu1

MBZUAI 1 MBZUAI
Renmin University of China 2 Renmin University of China
Harvard University 3 Harvard University

† Corresponding author: Yuhan Liu · yuhan.liu@mbzuai.ac.ae

As video generators move from localized edits to fully synthetic, cinematic scenes, artifact-centric detection is no longer enough. This survey reframes AIGC-V detection as factual fidelity verification: whether the events, entities, and physical processes depicted in a video remain consistent with the real world.

Core Framing

From artifact matching to evidence tracing.

Modern AI-generated video cannot be understood through frame artifacts alone. Detection must combine low-level intrinsic cues, spatiotemporal consistency, cross-modal consistency, and language-guided world-level reasoning.

Problem

Factual Fidelity Verification

The task is not only to label a clip as fake, but to identify which claim, event, entity, or temporal segment breaks consistency with reality.

Taxonomy

Vision-Language Dual-View

The visual side covers intrinsic cues and spatiotemporal consistency inside the clip, while the language side covers cross-modal consistency and language-guided world-level reasoning.

Deployment

Evidence-First Detection

Trustworthy systems should identify, localize, and explain with structured evidence instead of collapsing everything into a single score.

AIGC-V detection overview figure

The survey links generation settings, dual-view evidence pathways, and final verification outputs in a single end-to-end detection framing.

Generation Side

Three paradigms reshape where the evidence lives.

The survey organizes AI-generated video into three broad paradigms. Each one changes which cues are most useful, and therefore which detection strategies remain reliable.

Overview of three AIGC-V paradigms

The evidence profile shifts as generation moves from real-carrier manipulation to fully synthetic video synthesis.

LMV

Local Manipulation

Real footage is preserved while localized regions or attributes are re-rendered. The strongest evidence is often spatially concentrated and residual.

  • Best matched with intrinsic visual cues and transfer-oriented forensic tests.
  • Representative threat: face swaps, partial attribute editing, localized tampering.
AVE

Audio-Visual Editing

Speech drives the visual stream, making synchrony, speaker consistency, and speech-conditioned motion more decisive than generic image artifacts.

  • Cross-modal alignment and localization become central.
  • Representative threat: dubbing, talking-head generation, lip-sync editing.
GVS

Generative Video Synthesis

No authentic carrier remains. Detection shifts toward long-range coherence, world plausibility, provenance, and language-guided verification.

  • Motivates stronger Layer 2 to Layer 4 reasoning.
  • Representative threat: text-to-video, prompt-conditioned open-domain synthesis.
Detection Side

The dual-view, four-layer taxonomy.

The taxonomy is organized by the dominant evidence pathway used for decision-making, not by a rigid task label. It moves from low-level perception to high-level cognition while keeping the boundary between layers operational.

Dual-view four-layer taxonomy figure

Visual view: Layers 1-2. Language view: Layers 3-4. The boundary depends on what evidence the detector actually needs.

Visual View

Perceptual evidence inside the video

Focuses on the visual modality and emphasizes statistical differences between AIGC-V and real videos. The pathway extends from frame-level intrinsic cues to spatiotemporal consistency across frames, forming the perceptual perspective for factual-fidelity verification.

L1

Intrinsic Cue Analysis

Tests whether low-level visual signals still follow the statistical regularities of real videos, including frequency fingerprints, local geometry, physiological traces, and residual synthesis artifacts.

Boundary L1-2: Layer 1 focuses on frame-level intrinsic distributional cues.
L2

Spatiotemporal Consistency

Models frame-sequence relations to test whether motion, behavior, and physical transitions unfold plausibly over time as a real recorded video.

Boundary L1-2 / L2-3: Layer 2 moves beyond frames to explicitly model spatiotemporal connections across the frame sequence, but it still stays within the visual modality and evaluates within-visual coherence.
Language View

Grounded verification across semantics and world knowledge

Defines a grounded verification pathway in which audio, speech, text, and visual evidence are first checked for within-video consistency, then extended to factual and knowledge-level verification when external evidence is required.

L3

Cross-Modal Consistency

Performs within-video multimodal verification across speech, text, and visuals, from lip-speech synchrony to voice-face identity coherence and speech-text agreement.

Boundary L2-3 / L3-4: Layer 3 goes beyond vision and relies on within-video cross-modal agreement, judging whether different modalities are mutually consistent in describing the same semantics.
L4

Language-Guided World-Level Reasoning

Moves beyond within-video agreement to verify implied claims against external facts, commonsense, physical rules, and explicit evidence chains.

Boundary L3-4: Layer 4 further requires external facts, commonsense, or physics for reasoning-based verification, even when the modalities appear mutually consistent.
Operational boundary: Evidence arrives through two complementary pathways: the visual view tests intrinsic and temporal plausibility, while the language view moves from within-video cross-modal verification to explicit claim checking and world-level reasoning. The split is decided by the detector's dominant evidence pathway.
Methods

A field shifting toward the language view.

The landscape figure highlights the transition from artifact matching in traditional deepfake detection to evidence-based semantic verification enabled by vision-language models and agentic pipelines.

Landscape of representative AIGC-V detection methods
Representative AIGC-V detection methods aligned with the four-layer taxonomy.
Layer 1 Remain foundational, but increasingly only the first signal rather than the whole detector. Pixel and geometric artifacts, physiology, robustness.
Layer 2 Become the visual center of gravity as single-frame artifacts grow less reliable. Motion inconsistency, physics, human behavior.
Layer 3 Moving to a mainstream verification pathway. Audio-visual agreement, text-video reasoning, localization.
Layer 4 Rising quickly as high-realism videos demand explicit evidence-first explainable checking. Prompts, tools, retrieval, preference and reward modeling.

These cards summarize the broader migration pattern across layers, while the table below anchors that shift with the survey's complete-year counts through 2025.

Year L1 L2 L3 L4 Visual Language Language-View Share
2020 5 7 1 0 12 1 yearly / cumulative
2021 5 6 4 0 11 4 yearly / cumulative
2022 4 3 2 0 7 2 yearly / cumulative
2023 3 3 4 0 6 4 yearly / cumulative
2024 9 5 4 3 14 7 yearly / cumulative
2025 0 18 15 15 18 30 yearly / cumulative

The mini-cards summarize the longer-term layer transition, and the year-wise trend table anchors it with the survey's complete-year summary through 2025; in-progress 2026 entries are intentionally kept outside this trend view.

Benchmarks

Evaluation needs diagnostic depth, not only scores.

Benchmark design should reveal why factual fidelity fails, where the failure occurs, and whether the system can ground its explanation with checkable evidence.

LMV · 14 benchmarks

Local Manipulation Video

The deepest protocol tradition, emphasizing localized forensic residue, compression robustness, and cross-dataset transfer.

AVE · 12 benchmarks

Audio-Visual Editing

Smaller but tightly aligned with synchrony, speaker-content consistency, and timestamped localization.

GVS · 20 benchmarks

Generative Video Synthesis

Rapidly changing benchmarks targeting cross-generator transfer, semantic fabrication, plausibility, and world-level reasoning.

Adjacency · 18 diagnostics

Physics, World Dynamics, Explanation

Adjacent diagnostics matter because trustworthy detection increasingly depends on physical plausibility and grounded explanations.

What the next benchmark regime should look like Claim-level supervision, timestamped evidence, shortcut-resistant stress tests, and continuously refreshed generator pools are all necessary if AIGC-V detection is to stay relevant under fast-moving model turnover.
Paper List

The full categorized paper library.

Browse the paper-first index by paradigm, detection layer, benchmark family, and adjacent diagnostic category.

Complete categorized index

Move cleanly across paradigms, detection layers, and benchmark families, then open the corresponding reference tables below.

Navigation Tip

Use Methods Overview for the taxonomy first, then switch to paradigms, layer pages, or benchmark families when you want a more specific slice of the literature.

Overview

Repository-level entry points and the complete flattened list.

Generation Paradigms

Navigate by manipulation type and synthesis regime.

Detection Layers

Follow the survey taxonomy from intrinsic cues to world-level reasoning.

Benchmarks and Diagnostics

Inspect evaluation families and adjacent diagnostic testbeds.

Select a route above to update the reading panel below.

Loading the selected library view.

📚 Paper List

A flat, paper-first index in the same style as the reference list. Detailed notes and extra metadata stay on the linked section pages.

Paradigms

Local Manipulation

  • [2025] FakeParts: a New Family of AI-Generated DeepFakes. [paper]
  • [2025] FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection. [paper]
  • [2025] DynamicFace: High-quality and consistent face swapping for image and video using composable 3D facial priors. [paper]
  • [2024] FuseAnyPart: Diffusion-Driven Facial Parts Swapping via Multiple Reference Images. [paper]

Audio-Visual Editing

  • [2025] SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion. [paper]
  • [2025] Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation. [paper]
  • [2024] Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis. [paper]
  • [2023] GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis. [paper]
  • [2022] VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing in the Wild. [paper]

Generative Video Synthesis

  • [2026] Official Launch of Seedance 2.0. [paper]
  • [2025] Show-1: Marrying pixel and latent diffusion models for text-to-video generation. [paper]
  • [2025] Kling O1: Unified Multimodal Video Model. [paper]
  • [2025] Veo 3. [paper]
  • [2024] Grid diffusion models for text-to-video generation. [paper]
  • [2024] Sora 2. [paper]
  • [2024] Introducing Gen-3 Alpha: A New Frontier for Video Generation. [paper]
  • [2024] Dream Machine. [paper]
  • [2023] Scalable Diffusion Models with Transformers. [paper]
  • [2022] Video Diffusion Models. [paper]
  • [2022] Imagen Video: High Definition Video Generation with Diffusion Models. [paper]
  • [2022] Make-A-Video: Text-to-Video Generation without Text-Video Data. [paper]

Layer 1: Intrinsic Visual Cues

A. Pixel and geometric artifacts

  • [12/2024] Freqblender: Enhancing deepfake detection by blending frequency knowledge. [paper]
  • [10/2024] Real appearance modeling for more general deepfake detection. [paper]
  • [06/2024] Beyond deepfake images: Detecting ai-generated videos. [paper]
  • [06/2023] Noise based deepfake detection via multi-head relative-interaction. [paper]
  • [10/2022] Hierarchical contrastive inconsistency learning for deepfake video detection. [paper]
  • [06/2021] MagDR: Mask-Guided Detection and Reconstruction for Defending Deepfakes. [paper]
  • [06/2021] Improving the efficiency and robustness of deepfakes detection through precise geometric features. [paper]
  • [05/2019] Exposing deep fakes using inconsistent head poses. [paper]

B. Physiological features

  • [02/2024] Local attention and long-distance interaction of rPPG for deepfake detection. [paper]
  • [07/2022] Visual Representations of Physiological Signals for Fake Video Detection. [paper]
  • [10/2021] Exposing deepfake with pixel-wise ar and ppg correlation from faint signals. [paper]
  • [10/2021] A study on effective use of bpm information in deepfake detection. [paper]
  • [10/2020] Deeprhythm: Exposing deepfakes with attentional visual heartbeat rhythms. [paper]
  • [10/2020] Deepfakeson-phys: Deepfakes detection based on heart rate estimation. [paper]
  • [09/2020] How do the hearts of deep fakes beat? Deep fake source detection via interpreting residuals with biological signals. [paper]
  • [07/2020] Fakecatcher: Detection of synthetic portrait videos using biological signals. [paper]
  • [10/2019] Predicting Heart Rate Variations of Deepfake Videos using Neural ODE. [paper]
  • [12/2018] In ictu oculi: Exposing ai created fake videos by detecting eye blinking. [paper]

C. Distribution discrepancy and robustness

  • [03/2026] Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection. [paper]
  • [12/2024] Can we leave deepfake data behind in training deepfake detector? [paper]
  • [10/2024] Fake It till You Make It: Curricular Dynamic Forgery Augmentations Towards General Deepfake Detection. [paper]
  • [06/2024] Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection. [paper]
  • [06/2024] Exploiting style latent flows for generalizing deepfake video detection. [paper]
  • [06/2024] Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos. [paper]
  • [10/2023] Seeable: Soft discrepancies and bounded contrastive learning for exposing deepfakes. [paper]
  • [10/2023] Quality-agnostic deepfake detection with intra-model collaborative learning. [paper]
  • [12/2022] Ost: Improving generalization of deepfake detection via one-shot test-time training. [paper]
  • [10/2020] Towards generalizable deepfake detection with locality-aware autoencoder. [paper]

Layer 2: Spatiotemporal Consistency

A. Temporal and motion inconsistencies

  • [06/2025] GC-ConsFlow: Leveraging Optical Flow Residuals and Global Context for Robust Deepfake Detection. [paper]
  • [06/2025] Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning. [paper]
  • [01/2025] Vulnerability-Aware Spatio-Temporal Learning for Generalizable and Interpretable Deepfake Video Detection. [paper]
  • [11/2024] Learning natural consistency representation for face forgery video detection. [paper]
  • [06/2024] Learning spatiotemporal inconsistency via thumbnail layout for face deepfake detection. [paper]
  • [02/2024] Decof: Generated video detection via frame consistency. [paper]
  • [10/2023] Dynamic difference learning with spatio--temporal correlation for deepfake video detection. [paper]
  • [10/2023] Tall: Thumbnail layout for deepfake video detection. [paper]
  • [07/2022] Region-Aware Temporal Inconsistency Learning for DeepFake Video Detection. [paper]
  • [05/2022] Combining efficientnet and vision transformers for video deepfake detection. [paper]
  • [02/2022] Delving into the local: Dynamic inconsistency learning for deepfake video detection. [paper]
  • [10/2021] Spatiotemporal inconsistency learning for deepfake video detection. [paper]
  • [08/2021] Detecting Deepfake Videos with Temporal Dropout 3DCNN. [paper]
  • [08/2021] Dynamic Inconsistency-aware DeepFake Video Detection. [paper]
  • [01/2021] Interpretable and trustworthy deepfake detection via dynamic prototypes. [paper]
  • [10/2020] Two-branch recurrent network for isolating deepfakes in videos. [paper]
  • [10/2020] Sharp multiple instance learning for deepfake video detection. [paper]
  • [07/2020] Fsspotter: Spotting face-swapped video by spatial and temporal clues. [paper]

B. Physical and frequency artifacts

  • [01/2026] MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations. [paper]
  • [10/2025] D3: Training-Free AI-Generated Video Detection Using Second-Order Features. [paper]
  • [10/2025] Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection. [paper]
  • [07/2025] AI-Generated Video Detection via Perceptual Straightening. [paper]
  • [07/2025] Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection. [paper]
  • [07/2025] Leveraging Pre-Trained Visual Models for AI-Generated Video Detection. [paper]
  • [07/2025] De-Fake: Style based Anomaly Deepfake Detection. [paper]
  • [06/2025] Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation. [paper]
  • [03/2025] VoD: Learning Volume of Differences for Video-Based Deepfake Detection. [paper]
  • [01/2025] DiffFake: Exposing Deepfakes using Differential Anomaly Detection. [paper]
  • [12/2024] DIP: diffusion learning of inconsistency pattern for general deepfake detection. [paper]
  • [11/2024] A quality-centric framework for generic deepfake detection. [paper]
  • [06/2020] Towards untrusted social video verification to combat deepfakes via face geometry consistency. [paper]

C. Human behavioral and interaction dynamics

  • [09/2025] DeepFake Detection in Dyadic Video Calls using Point of Gaze Tracking. [paper]
  • [08/2025] When Deepfake Detection Meets Graph Neural Network: a Unified and Lightweight Learning Framework. [paper]
  • [06/2025] Detecting Localized Deepfake Manipulations Using Action Unit-Guided Video Representations. [paper]
  • [10/2023] Exploiting complementary dynamic incoherence for deepfake video detection. [paper]
  • [06/2021] Lips Don't Lie: A Generalisable and Robust Approach To Face Forgery Detection. [paper]
  • [12/2020] Detecting deep-fake videos from appearance and behavior. [paper]
  • [12/2020] Identity-driven deepfake detection. [paper]
  • [03/2020] Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues. [paper]

Layer 3: Cross-Modal Consistency

A. Audio-visual consistency detection

  • [03/2026] X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection. [paper]
  • [01/2026] Revealing the Truth with ConLLM for Detecting Multi-Modal Deepfakes. [paper]
  • [10/2025] PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis. [paper]
  • [10/2025] KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features. [paper]
  • [05/2025] CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation. [paper]
  • [04/2025] Multi-modal deepfake detection via multi-task audio-visual prompt learning. [paper]
  • [06/2024] Lost in Translation: Lip-Sync Deepfake Detection from Audio-Video Mismatch. [paper]
  • [06/2024] AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection. [paper]
  • [06/2024] Zero-Shot Fake Video Detection by Audio-Visual Consistency. [paper]
  • [11/2023] Voice-Face Homogeneity Tells Deepfake. [paper]
  • [10/2023] Integrating Audio-Visual Features for Multimodal Deepfake Detection. [paper]
  • [11/2022] Lip Sync Matters: A Novel Multimodal Forgery Detector. [paper]
  • [04/2022] Audio-Visual Person-of-Interest DeepFake Detection. [paper]
  • [03/2022] An Audio-Visual Attention Based Multimodal Network for Fake Talking Face Videos Detection. [paper]
  • [10/2021] Joint Audio-Visual Deepfake Detection. [paper]
  • [07/2021] DeepFake Videos Detection Using Self-Supervised Decoupling Network. [paper]
  • [06/2021] Detecting Deep-Fake Videos From Aural and Oral Dynamics. [paper]
  • [12/2020] Preventing DeepFake Attacks on Speaker Authentication by Dynamic Lip Movement Analysis. [paper]
  • [10/2020] Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization. [paper]

B. Text-video semantic consistency reasoning

  • [07/2025] T^3SVFND: Towards an Evolving Fake News Detector for Emergencies with Test-time Training on Short Video Platforms. [paper]
  • [06/2025] Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation. [paper]
  • [04/2025] Consistency-aware Fake Videos Detection on Short Video Platforms. [paper]

C. Robust learning and temporal localization

  • [02/2026] Divide and Conquer: Multimodal Video Deepfake Detection via Cross-Modal Fusion and Localization. [paper]
  • [01/2026] A-V Representation Learning via Audio Shift Prediction for Multimodal Deepfake Detection and Temporal Localization. [paper]
  • [10/2025] HOLA: Enhancing Audio-visual Deepfake Detection via Hierarchical Contextual Aggregations and Efficient Pre-training. [paper]
  • [10/2025] A Multimodal Deviation Perceiving Framework for Weakly-Supervised Temporal Forgery Localization. [paper]
  • [08/2025] SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection. [paper]
  • [08/2025] Weakly Supervised Multimodal Temporal Forgery Localization via Multitask Learning. [paper]
  • [06/2025] Circumventing Shortcuts in Audio-visual Deepfake Detection Datasets with Unsupervised Learning. [paper]
  • [04/2025] Audio-Visual Deepfake Detection With Local Temporal Inconsistencies. [paper]
  • [11/2024] DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization. [paper]
  • [04/2024] Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection. [paper]
  • [06/2023] Self-supervised video forensics by audio-visual anomaly detection. [paper]

Layer 4: World-Level Reasoning

A. Prompts and adapters for representation calibration

  • [07/2025] Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection. [paper]
  • [06/2025] AuthGuard: Generalizable Deepfake Detection via Language Guidance. [paper]
  • [04/2025] Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection. [paper]
  • [01/2025] DeepFake-Adapter: Dual-Level Adapter for DeepFake Detection. [paper]
  • [11/2024] Prompt-guided Multi-modal contrastive learning for Cross-compression-rate Deepfake Detection. [paper]
  • [11/2024] On Using rPPG Signals for DeepFake Detection: A Cautionary Note. [paper]
  • [06/2024] Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics. [paper]
  • [06/2024] How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative Study of ChatGPT, AI Models and Human Perception. [paper]

B. Tool-augmented agents for evidence gathering

  • [12/2025] DeepAgent: A Dual Stream Multi Agent Fusion for Robust Multimodal Deepfake Detection. [paper]
  • [08/2025] Memory-Anchored Multimodal Reasoning for Explainable Video Forensics. [paper]
  • [06/2025] DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning. [paper]
  • [02/2025] LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection. [paper]

C. Post-training, preferences and rewards

  • [02/2026] VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning. [paper]
  • [12/2025] Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning. [paper]
  • [10/2025] VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL. [paper]
  • [10/2025] EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning. [paper]
  • [09/2025] Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs. [paper]
  • [08/2025] Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning. [paper]
  • [07/2025] BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM. [paper]
  • [05/2025] BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation. [paper]
  • [10/2024] X2-DFD: A framework for eXplainable and eXtendable Deepfake Detection. [paper]

Benchmarks: Local Manipulation Video

Local Manipulation Video (LMV)

  • [02/2026] Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models. [paper]
  • [11/2025] Exddv: A new dataset for explainable deepfake detection in video. [paper]
  • [09/2024] Common sense reasoning for deepfake detection. [paper]
  • [06/2024] Ai-face: A million-scale demographically annotated ai-generated face dataset and fairness benchmark. [paper]
  • [07/2023] DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection. [paper]
  • [06/2023] DF-Platter: Multi-Face Heterogeneous Deepfake Dataset. [paper]
  • [01/2023] A Continual Deepfake Detection Benchmark: Dataset, Methods, and Essentials. [paper]
  • [10/2021] Kodf: A large-scale korean deepfake detection dataset. [paper]
  • [06/2021] ForgeryNet: A Versatile Benchmark for Comprehensive Forgery Analysis. [paper]
  • [10/2020] WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection. [paper]
  • [06/2020] The DeepFake Detection Challenge (DFDC) Dataset. [paper]
  • [05/2020] DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection. [paper]
  • [09/2019] Celeb-df: A large-scale challenging dataset for deepfake forensics. [paper]
  • [01/2019] Faceforensics++: Learning to detect manipulated facial images. [paper]

Benchmarks: Audio-Visual Editing

Audio-Visual Editing (AVE)

  • [03/2026] X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection. [paper]
  • [10/2025] Av-deepfake1m++: A large-scale audio-visual deepfake benchmark with real-world perturbations. [paper]
  • [08/2025] VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset. [paper]
  • [08/2025] Memory-Anchored Multimodal Reasoning for Explainable Video Forensics. [paper]
  • [07/2025] SocialDF: Benchmark Dataset and Detection Model for Mitigating Harmful Deepfake Content on Social Media Platforms. [paper]
  • [05/2025] Tell me Habibi, is it Real or Fake? [paper]
  • [05/2025] MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark. [paper]
  • [05/2025] Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection. [paper]
  • [10/2024] AV-Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset. [paper]
  • [08/2024] WWW: Where, Which and Whatever Enhancing Interpretability in Multimodal Deepfake Detection. [paper]
  • [11/2022] Do you really mean that? content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization. [paper]
  • [08/2021] FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset. [paper]

Benchmarks: Generative Video Synthesis

Generative Video Synthesis (GVS)

  • [02/2026] SynthForensics: A Multi-Generator Benchmark for Detecting Synthetic Video Deepfakes. [paper]
  • [02/2026] VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning. [paper]
  • [01/2026] Your One-Stop Solution for AI-Generated Video Detection. [paper]
  • [12/2025] Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning. [paper]
  • [12/2025] Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans? [paper]
  • [10/2025] EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning. [paper]
  • [10/2025] AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences. [paper]
  • [09/2025] Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs. [paper]
  • [07/2025] BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM. [paper]
  • [06/2025] GenWorld: Towards Detecting AI-generated Real-world Simulation Videos. [paper]
  • [06/2025] Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection. [paper]
  • [06/2025] DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning. [paper]
  • [05/2025] BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation. [paper]
  • [04/2025] LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models. [paper]
  • [03/2025] Deepfake-eval-2024: A multi-modal in-the-wild benchmark of deepfakes circulated in 2024. [paper]
  • [01/2025] Genvidbench: A challenging benchmark for detecting ai-generated video. [paper]
  • [12/2024] On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection. [paper]
  • [05/2024] Distinguish any fake videos: Unleashing the power of large-scale data and motion features. [paper]
  • [05/2024] DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark. [paper]
  • [02/2024] Detecting AI-Generated Video via Frame Consistency. [paper]

Benchmarks: Adjacent Diagnostics

A. Physical Rule Violations

  • [03/2026] Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning. [paper]
  • [01/2026] VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation. [paper]
  • [07/2025] PhyWorldBench: A Physical Realism Benchmark for Text-to-Video Generation. [paper]
  • [05/2025] T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation. [paper]
  • [04/2025] Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments. [paper]
  • [03/2025] Impossible Videos. [paper]
  • [01/2025] Do generative video models understand physical principles? [paper]
  • [06/2024] VideoPhy: Evaluating Physical Commonsense for Video Generation. [paper]

B. World Dynamics and Causality

  • [12/2025] SVBench: Evaluation of Video Generation Models on Social Reasoning. [paper]
  • [10/2025] VideoVerse: How Far is Your T2V Generator from a World Model? [paper]
  • [07/2025] T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation. [paper]
  • [12/2024] Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation. [paper]
  • [10/2024] WorldSimBench: Towards Video Generation Models as World Simulators. [paper]
  • [10/2024] Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation. [paper]

C. Explanation-Oriented Diagnosis

  • [12/2025] VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding. [paper]
  • [12/2025] PhyDetEx: A Benchmark Dataset and Method for Detecting and Explaining Physical Plausibility in Text-to-Video Models. [paper]
  • [11/2025] SPOTLIGHT: Identifying and Localizing Video Generation Errors Using VLMs. [paper]
  • [10/2025] TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility. [paper]
Challenges and Future

Toward evidence-first and trustworthy AIGC-V detection.

The next phase is not just stronger clip-level discrimination. It requires diagnostic evaluation, claim-level supervision, unified explainable detection, and deployment protocols that preserve calibration when evidence is weak, conflicting, or incomplete.

01

Robust diagnostic evaluation

Move beyond clip-level AUC or EER. Evaluation should reveal which factual proposition fails, where the violation occurs, and how brittle the detector becomes under shift, transfer, codec changes, and modern synthesis pipelines.

02

Claim-level and dynamic evaluation

Benchmarks should move from clip labels to checkable claims with timestamped evidence, targeted stress tests, and continuously refreshed generator pools rather than static one-off test sets.

03

Unified explainable detection

Trustworthy AIGC-V detection is a two-pathway problem: perceptual evidence plus fact-level verification. Cross-layer fusion should preserve low-level traces, localized mismatches, and tested claims as individually inspectable evidence objects.

04

Evidence-first trustworthy detection

Systems should identify, localize, and explain with calibrated uncertainty, provenance-aware cross-checks, and abstention when evidence remains incomplete or internally contradictory.

Broader research direction Progress on trustworthy AIGC-V detection will likely require tighter collaboration between CV and NLP: CV contributes grounded perceptual and spatiotemporal evidence, while NLP contributes claim decomposition, retrieval, reasoning, grounding, and evidence-aware explanation in a unified evidence graph that also supports provenance checks, calibration, and abstention.
Citation

Citation.

If you use this survey or its taxonomy in your work, please cite it as follows.

@misc{hou2026detecting,
  title        = {Detecting AI-Generated Video: A Vision-Language Dual-View Survey},
  author       = {Dylan Xinming Hou and Juntian Zhang and Xu Gu and Yichen Wu and
                  Nils Lukas and Gus Xia and Xiuying Chen and Yuhan Liu},
  year         = {2026},
  note         = {Survey manuscript},
  howpublished = {\url{assets/paper.pdf}}
}