# Audio-Visual Editing (AVE)

[Back to benchmarks](README.md) · [Back to home](../README.md)

AVE benchmarks center on synchronized multimodal manipulation: lip-audio alignment, speaker identity consistency, dubbed or spliced segments, and temporal localization.

| Date | Benchmark / Dataset | Paper | Focus | Venue |
| --- | --- | --- | --- | --- |
| 03/2026 | MMDF | [X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection](https://doi.org/10.48550/ARXIV.2603.08483)<br>Kim et al. | Multimodal deepfake dataset spanning GAN, diffusion, and flow-matching manipulations. | CVPR |
| 10/2025 | AV-Deepfake1M++ | [Av-deepfake1m++: A large-scale audio-visual deepfake benchmark with real-world perturbations](https://doi.org/10.1145/3746027.3761979)<br>Cai et al. | 2M clips extending AV-Deepfake1M with diverse audio-visual manipulations. | ACM MM |
| 08/2025 | VCapAV | [VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection Dataset](https://doi.org/10.21437/interspeech.2025-1713)<br>Wang et al. | Video-caption-based audio-visual AIGC-V detection dataset. | Proc. Interspeech 2025 |
| 08/2025 | X-AVFake | [Memory-Anchored Multimodal Reasoning for Explainable Video Forensics](https://arxiv.org/abs/2508.14581)<br>Chen et al. | Dual-modality manipulations with grounded language reasoning. | arXiv |
| 07/2025 | SocialDF | [SocialDF: Benchmark Dataset and Detection Model for Mitigating Harmful Deepfake Content on Social Media Platforms](https://doi.org/10.1145/3733567.3735573)<br>Batra et al. | 2,126 social-media videos with real and deepfake content under SOTA manipulations. | ACM MMAD Workshop |
| 05/2025 | ArEnAV | [Tell me Habibi, is it Real or Fake?](https://arxiv.org/abs/2505.22581)<br>Kuckreja et al. | Audio-visual deepfake dataset for Arabic--English code-switching (CSW). | arXiv |
| 05/2025 | MAVOS-DD | [MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark](https://arxiv.org/abs/2505.11109)<br>Croitoru et al. | Multilingual open-set benchmark for audio-visual AIGC-V detection. | arXiv |
| 05/2025 | DigiFakeAV | [Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection](https://arxiv.org/abs/2505.16512)<br>Liu et al. | Benchmark for diffusion-based digital-human audio-visual forgeries. | arXiv |
| 10/2024 | AV-Deepfake1M | [AV-Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset](https://doi.org/10.1145/3664647.3680795)<br>Cai et al. | Large-scale LLM-driven audio-visual deepfake dataset. | ACM MM |
| 08/2024 | FakeMix | [WWW: Where, Which and Whatever Enhancing Interpretability in Multimodal Deepfake Detection](https://arxiv.org/abs/2408.02954)<br>Jung et al. | Clip-level benchmark for manipulated audio and video segments. | arXiv |
| 11/2022 | LAV-DF | [Do you really mean that? content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization](https://doi.org/10.1109/dicta56598.2022.10034605)<br>Cai et al. | Content-driven localized audio-visual deepfake dataset. | DICTA |
| 08/2021 | FakeAVCeleb | [FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset](https://arxiv.org/abs/2108.05080)<br>Khalid et al. | Audio-visual multimodal deepfake dataset. | arXiv |
