广东工业大学学报 ›› 2023, Vol. 40 ›› Issue (06): 155-167.doi: 10.12052/gdutxb.230130

• 人工智能 • 上一篇    下一篇

基于Transformer的人脸深度伪造检测技术综述

赖志茂1,2, 章云1, 李东1   

  1. 1. 广东工业大学 自动化学院, 广东 广州 510006;
    2. 中国人民警察大学 移民管理学院(广州), 广东 广州 510663
  • 收稿日期:2023-10-08 出版日期:2023-11-25 发布日期:2023-11-08
  • 通信作者: 章云(1963-),男,教授,博士,博士生导师,主要研究方向为复杂系统建模与控制、图像处理与模式识别,E-mail:yun@gdut.edu.cn
  • 作者简介:赖志茂(1987-),男,讲师,博士研究生,主要研究方向为人工智能、多媒体信息安全
  • 基金资助:
    国家自然科学基金资助项目 (62271349) ;广东省基础与应用基础研究基金资助项目 (2021A1515011867) ;中国人民警察大学国家基金培育课题 (JJPY202402)

A Survey of Deepfake Detection Techniques Based on Transformer

Lai Zhi-mao1,2, Zhang Yun1, Li Dong1   

  1. 1. School of Automation, Guangdong University of Technology, Guangzhou 510006, China;
    2. School of Immigration Administration (Guangzhou), China People's Police University, Guangzhou 510663, China
  • Received:2023-10-08 Online:2023-11-25 Published:2023-11-08

摘要: 人脸深度伪造检测旨在对人脸图像和视频进行真伪鉴别,能为肖像权保护、虚假消息鉴定、网络诈骗防范等提供理论和技术支撑。早期的检测技术主要基于卷积神经网络(Convolutional Neural Networks,CNNs) 实现,并取得了显著的效果,但普遍存在泛化性能不足的问题。为了进一步提高人脸深度伪造检测技术的泛化性,最新的研究工作开始引入一种基于自我注意力机制的深度神经网络Transformer,其具有长距离依赖建模能力和全局感受野,可用于捕捉到图像上下文关联和视频时序关系,有效提高了检测器的表征能力。本综述首先简要介绍了该领域研究背景,阐述了人脸深度伪造生成典型技术,然后对现有基于Transformer的检测技术进行总结和归纳,最后探讨人脸深度伪造检测技术面临的挑战和未来研究方向。

关键词: 人脸深度伪造, 检测技术, Transformer, 生成技术, 自注意力机制

Abstract: Deepfake detection aims to authenticate facial images and videos, which can offer operational and technical support to safeguard personal portrait rights, prevent fake news, and curb online deceit. Early detection technologies are usually based on convolutional neural networks (CNNs) and have achieved promising detection performance. However, there exists a prevalent issue of mediocre generalisation performance. To enhance the overall generality of Deepfake detection, recent research has focused on a deep neural network Transformer by utilizing the self-attention mechanisms. The Transformer can better model long-distance dependency and global receptive fields to capture the image context association and video timing relationships, such that the representation ability of the detectors can be improved. This survey first provides an overview of the research background in this field, followed by an explanation of the common techniques used to generate Deepfake. Then, the existing Transformer-based detection methods are summarized and comparatively evaluated. Finally, the challenges and future research directions of Deepfake detection are discussed.

Key words: Deepfake, detection techniques, Transformer, generation techniques, self-attention mechanisms

中图分类号: 

  • TP183
[1] ROETTGERS J. Porn producers offer to help hollywood take down deepfake videos [EB/OL]. (2018-02-21) [2023-09-30].https://variety.com/2018/digital/news/deepfakes-porn-adult-industry-12027057-49/.
[2] ROSSLER A, COZZOLINO D, VERDOLIVA L, et al. Faceforensics++: learning to detect manipulated facial images[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, South Korea: IEEE, 2019: 1-11.
[3] QIAN Y, YIN G, SHENG L, et al. Thinking in frequency: face forgery detection by mining frequency-aware clues[C]//European Conference on Computer Vision. Online: Springer, 2020: 86-103.
[4] MASI I, KILLEKAR A, MASCARENHAS R M, et al. Two-branch recurrent network for isolating deepfakes in videos[C] //European Conference on Computer Vision. Online: Springer, 2020: 667-684.
[5] LI L, BAO J, ZHANG T, et al. Face x-ray for more general face forgery detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 5001-5010.
[6] ZHAO H, ZHOU W, CHEN D, et al. Multi-attentional deepfake detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Online: IEEE, 2021: 2185-2194.
[7] LIU H, LI X, ZHOU W, et al. Spatial-phase shallow learning: rethinking face forgery detection in frequency domain[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Online: IEEE, 2021: 772-781.
[8] ZHAO T, XU X, XU M, et al. Learning self-consistency for deepfake detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 15023-15033.
[9] HALIASSOS A, VOUGIOUKAS K, PETRIDIS S, et al. Lips don't lie: a generalisable and robust approach to face forgery detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Online: IEEE, 2021: 5039-5049.
[10] CHUGH K, GUPTA P, DHALL A, et al. Not made for each other-audio-visual dissonance-based deepfake detection and localization[C]//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM, 2020: 439-447.
[11] MITTAL T, BHATTACHARYA U, CHANDRA R, et al. Emotions don't lie: an audio-visual deepfake detection method using affective cues[C]//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM, 2020: 2823-2832.
[12] ZHOU Y, LIM S N. Joint audio-visual deepfake detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 14800-14809.
[13] ZHOU D, KANG B, JIN X, et al. Deepvit: Towards deeper vision transformer[EB/OL]. arXiv: 2103.11886(2021-03-21) [2023-09-20].https://arxiv.org/abs/2103.11886.
[14] PENG Z, GUO Z, HUANG W, et al. Conformer: local features coupling global representations for recognition and detection [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(8): 9454-9468.
[15] GONG C, WANG D, LI M, et al. Vision transformers with patch diversification[EB/OL]. arXiv: 2104.12753(2021-06-11) [2023-09-20].https://arxiv.org/abs/2104.12753.
[16] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//31st Conference on Neural Information Processing Systems. Long Beach, USA: MIT Press, 2017: 5998–6008.
[17] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[C]//Proceedings of the 9th International Conference on Learning Representations. Online: ACM, 2021: 1-6.
[18] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 10012-10022.
[19] CHEN, RICHARD C F, FAN Q, et. al. Crossvit: cross-attention multi-scale vision transformer for image classification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 357-366.
[20] Deepfakes [EB/OL]. (2019-10-01) [2023-09-20].https://github.com/deepfakes/faceswap.
[21] KORSHUNOVA I, SHI W, DAMBRE J, et al. Fast face-swap using convolutional neural networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 3677-3685.
[22] LI L, BAO J, YANG H, et al. Advancing high fidelity identity swapping for forgery detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 5074-5083.
[23] CHEN R, CHEN X, NI B, et al. Simswap: an efficient framework for high fidelity face swapping[C]//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM, 2020: 2003-2011.
[24] THIES J, ZOLLHOFER M, STAMMINGER M, et al. Face2face: real-time face capture and reenactment of rgb videos[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 2387-2395.
[25] NIRKIN Y, KELLER Y, HASSNER T. FSGAN: subject agnostic face swapping and reenactment[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE, 2019: 7184-7193.
[26] 朱凯, 李理, 张彤等. 视觉Transformer在低级视觉领域的研究综述[J/OL]. 计算机工程与应用.https://link.cnki.net/urlid/11.2127.TP.20230817.1249.004
ZHU K, LI L, ZHANG T, et al. A survey of vision transformer in low-level computer vision[J/OL]. Computer Engineering and Applications.https://link.cnki.net/urlid/11.2127.TP.20230817.1249.004
[27] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]//International Conference on Machine Learning. Online: PMLR, 2021: 10347-10357.
[28] WU H, XIAO B, CODELLA N, et al. CvT: Introducing convolutions to vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 22-31.
[29] D'ASCOLI S, TOUVRON H, LEAVITT M L, et al. Convit: Improving vision transformers with soft convolutional inductive biases[C]//Proceedings of the 38th International Conference on Machine Learning. Online: ACM, 2021: 2286-2296.
[30] CHU X, TIAN Z, WANG Y, et al. Twins: revisiting the design of spatial attention in vision transformers [J]. Advances in Neural Information Processing Systems, 2021, 34: 9355-9366.
[31] CHEN R, PANDA R, FAN Q. RegionViT: regional-to-local attention for vision transformers[C]//International Conference on Learning Representations. Online: ACM, 2022.
[32] WANG W, XIE E, LI X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 568-578.
[33] YUAN L, CHEN Y, WANG T, et al. Tokens-to-token ViT: Training vision transformers from scratch on imagenet[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 558-567.
[34] ZHOU D, KANG B, JIN X, et al. DeepViT: Towards deeper vision transformer[EB/OL]. arXiv: 2103.11886(2021-04-19) [2023-09-20].https://doi.org/10.48550/arXiv.2103.11886.
[35] TOUVRON H, CORD M, SABLAYROLLES A, et al. Going deeper with image transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 32-42.
[36] DONG X, BAO J, CHEN D, et al. Protecting celebrities from deepfake with identity consistency transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE, 2022: 9468-9478.
[37] CHEN H, LIN Y, LI B, et al. Learning features of intra-consistency and inter-diversity: keys toward generalizable deepfake detection [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(3): 1468-1480.
[38] WANG J, WU Z, OUYANG W, et al. M2tr: multi-modal multi-scale transformers for deepfake detection[C]//Proceedings of the 2022 International Conference on Multimedia Retrieval. Newark, USA: ACM, 2022: 615-623.
[39] TAN Z, YANG Z, MIAO C, et al. Transformer-based feature compensation and aggregation for deepfake detection [J]. IEEE Signal Processing Letters, 2022, 29: 2183-2187.
[40] MIAO C, TAN Z, CHU Q, et al. Hierarchical frequency-assisted interactive networks for face manipulation detection [J]. IEEE Transactions on Information Forensics and Security, 2022, 17: 3008-3021.
[41] MIAO C, TAN Z, CHU Q, et al. F2Trans: high-frequency fine-grained transformer for face forgery detection [J]. IEEE Transactions on Information Forensics and Security, 2023, 18: 1039-1051.
[42] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural computation, 1997, 9(8): 1735-1780.
[43] ZHENG Y, BAO J, CHEN D, et al. Exploring temporal coherence for more general video face forgery detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 15044-15054.
[44] GUAN J, ZHOU H, HONG Z, et al. Delving into sequential patches for deepfake detection [J]. Advances in Neural Information Processing Systems, 2022, 35: 4517-4530.
[45] ZHAO C, WANG C, HU G, et al. ISTVT: interpretable spatial-temporal video transformer for deepfake detection [J]. IEEE Transactions on Information Forensics and Security, 2023, 18: 1335-1348.
[46] YU Y, NI R, ZHAO Y, et al. MSVT: multiple spatiotemporal views transformer for deepfake video detection [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(9): 4462-4471.
[47] CHENG H, GUO Y, WANG T, et al. Voice-face homogeneity tells deepfake[EB/OL]. arXiv: 2203.02195(2022-06-13) [2023-09-20].https://doi.org/10.48550/arXiv.2203.02195.
[48] ILYAS H, JAVED A, MALIK K M. AVFakeNet: A unified end-to-end dense swin transformer deep learning model for audio-visual deepfakes detection [J]. Applied Soft Computing, 2023, 136: 110124.
[49] YANG W, ZHOU X, CHEN Z, et al. AVoiD-DF: audio-visual joint learning for detecting deepfake [J]. IEEE Transactions on Information Forensics and Security, 2023, 18: 2015-2029.
[50] FENG C, CHEN Z, OWENS A. Self-supervised video forensics by audio-visual anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE, 2023: 10491-10503.
[51] YU Y, LIU X, NI R, et al. PVASS-MDD: predictive visual-audio alignment self-supervision for multimodal deepfake detection[J]. IEEE Transactions on Circuits and Systems for Video Technology.https://ieeexplore.ieee.org/document/10233898
[52] DeepfakeDetection [EB/OL]. (2019-10-01) [2023-09-30].https://github.com/ondyari/FaceForensics.
[53] DOLHANSKY B, HOWES R, PFLAUM B, et al. The deepfake detection challenge (DFDC) preview dataset[EB/OL]. arXiv: 1910.08854(2019-10-23) [2023-09-20].https://doi.org/10.48550/arXiv.1910.08854.
[54] LI Y, YANG X, SUN P, et al. Celeb-DF: a large-scale challenging dataset for deepfake forensics[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 3207-3216.
[55] LI Y, YANG X, SUN P, et al. Celeb-DF (v2): a new dataset for deepFake forensics [EB/OL]. arXiv: 1909.12962v3. (2019-11-22) [2023-09-20].https://doi.org/10.48550/arXiv:1909.12962v3.
[56] ZI B, CHANG M, CHEN J, et al. WildDeepfake: a challenging real-world dataset for deepfake detection[C]//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM, 2020: 2382-2390.
[57] DOLHANSKY B, BITTON J, PFLAUM B, et al. The deepfake detection challenge (DFDC) dataset [EB/OL]. arXiv: 2006.07397. (2020-10-28) [2023-09-20]. https://doi.org/10.48550/arXiv.2006.07397.
[58] JIANG L, LI R, WU W, et al. Deeperforensics-1.0: a large-scale dataset for real-world face forgery detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 2889-2898.
[59] DONG X, BAO J, CHEN D, et al. Identity-driven deepfake detection[EB/OL]. arXiv: 2012.03930(2022-09-07) [2023-09-20].https://doi.org/10.48550/arXiv.2012.03930.
[60] ZHOU T, WANG W, LIANG Z, et al. Face forensics in the wild[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Online: IEEE, 2021: 5778-5788.
[61] HE Y, GAN B, CHEN S, et al. ForgeryNet: a versatile benchmark for comprehensive forgery analysis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Online: IEEE, 2021: 4360-4369.
[62] KWON P, YOU J, NAM G, et al. KoDF: a large-scale Korean deepfake detection dataset[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 10744-10753.
[63] CAI Z, STEFANOV K, DHALL A, et al. Do you really mean that? content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization[C]//2022 International Conference on Digital Image Computing: Techniques and Applications. Online: IEEE, 2022: 1-10.
[64] NAGRANI A, CHUNG J, ZISSERMAN A. VoxCeleb: a large-scale speaker identification dataset[EB/OL]. arXiv: 1706.08612 (2017-06-26) [2023-09-20].https://doi.org/10.48550/arXiv.1706.08612.
[65] CHUNG J, NAGRANI A, ZISSERMAN A. VoxCeleb2: deep speaker recognition[EB/OL]. arXiv: 1806.05622(2017-06-26) [2023-09-20].https://doi.org/10.48550/arXiv.1806.05622.
[1] 冯广, 鲍龙. 基于红外可见光融合的复杂环境下人脸识别方法[J]. 广东工业大学学报, 2024, 41(03): 62-70,109.
[2] 郭傲, 许柏炎, 蔡瑞初, 郝志峰. 基于时序对齐的风格控制语音合成算法[J]. 广东工业大学学报, 2024, 41(02): 84-92.
[3] 曾安, 陈旭宙, 姬玉柱, 潘丹, 徐小维. 基于自注意力和三维卷积的心脏多类分割方法[J]. 广东工业大学学报, 2023, 40(06): 168-175.
[4] 刘洪伟, 林伟振, 温展明, 陈燕君, 易闽琦. 基于MABM的消费者情感倾向识别模型——以电影评论为例[J]. 广东工业大学学报, 2022, 39(06): 1-9.
[5] 苏成悦; . 均匀电宇宙中的爱因斯坦场方程解[J]. 广东工业大学学报, 2000, 17(3): 87-90.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!