Journal of Guangdong University of Technology ›› 2023, Vol. 40 ›› Issue (06): 155-167.doi: 10.12052/gdutxb.230130

• Artifical Intelligence • Previous Articles     Next Articles

A Survey of Deepfake Detection Techniques Based on Transformer

Lai Zhi-mao1,2, Zhang Yun1, Li Dong1   

  1. 1. School of Automation, Guangdong University of Technology, Guangzhou 510006, China;
    2. School of Immigration Administration (Guangzhou), China People's Police University, Guangzhou 510663, China
  • Received:2023-10-08 Online:2023-11-25 Published:2023-11-08

Abstract: Deepfake detection aims to authenticate facial images and videos, which can offer operational and technical support to safeguard personal portrait rights, prevent fake news, and curb online deceit. Early detection technologies are usually based on convolutional neural networks (CNNs) and have achieved promising detection performance. However, there exists a prevalent issue of mediocre generalisation performance. To enhance the overall generality of Deepfake detection, recent research has focused on a deep neural network Transformer by utilizing the self-attention mechanisms. The Transformer can better model long-distance dependency and global receptive fields to capture the image context association and video timing relationships, such that the representation ability of the detectors can be improved. This survey first provides an overview of the research background in this field, followed by an explanation of the common techniques used to generate Deepfake. Then, the existing Transformer-based detection methods are summarized and comparatively evaluated. Finally, the challenges and future research directions of Deepfake detection are discussed.

Key words: Deepfake, detection techniques, Transformer, generation techniques, self-attention mechanisms

CLC Number: 

  • TP183
[1] ROETTGERS J. Porn producers offer to help hollywood take down deepfake videos [EB/OL]. (2018-02-21) [2023-09-30].https://variety.com/2018/digital/news/deepfakes-porn-adult-industry-12027057-49/.
[2] ROSSLER A, COZZOLINO D, VERDOLIVA L, et al. Faceforensics++: learning to detect manipulated facial images[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, South Korea: IEEE, 2019: 1-11.
[3] QIAN Y, YIN G, SHENG L, et al. Thinking in frequency: face forgery detection by mining frequency-aware clues[C]//European Conference on Computer Vision. Online: Springer, 2020: 86-103.
[4] MASI I, KILLEKAR A, MASCARENHAS R M, et al. Two-branch recurrent network for isolating deepfakes in videos[C] //European Conference on Computer Vision. Online: Springer, 2020: 667-684.
[5] LI L, BAO J, ZHANG T, et al. Face x-ray for more general face forgery detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 5001-5010.
[6] ZHAO H, ZHOU W, CHEN D, et al. Multi-attentional deepfake detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Online: IEEE, 2021: 2185-2194.
[7] LIU H, LI X, ZHOU W, et al. Spatial-phase shallow learning: rethinking face forgery detection in frequency domain[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Online: IEEE, 2021: 772-781.
[8] ZHAO T, XU X, XU M, et al. Learning self-consistency for deepfake detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 15023-15033.
[9] HALIASSOS A, VOUGIOUKAS K, PETRIDIS S, et al. Lips don't lie: a generalisable and robust approach to face forgery detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Online: IEEE, 2021: 5039-5049.
[10] CHUGH K, GUPTA P, DHALL A, et al. Not made for each other-audio-visual dissonance-based deepfake detection and localization[C]//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM, 2020: 439-447.
[11] MITTAL T, BHATTACHARYA U, CHANDRA R, et al. Emotions don't lie: an audio-visual deepfake detection method using affective cues[C]//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM, 2020: 2823-2832.
[12] ZHOU Y, LIM S N. Joint audio-visual deepfake detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 14800-14809.
[13] ZHOU D, KANG B, JIN X, et al. Deepvit: Towards deeper vision transformer[EB/OL]. arXiv: 2103.11886(2021-03-21) [2023-09-20].https://arxiv.org/abs/2103.11886.
[14] PENG Z, GUO Z, HUANG W, et al. Conformer: local features coupling global representations for recognition and detection [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(8): 9454-9468.
[15] GONG C, WANG D, LI M, et al. Vision transformers with patch diversification[EB/OL]. arXiv: 2104.12753(2021-06-11) [2023-09-20].https://arxiv.org/abs/2104.12753.
[16] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//31st Conference on Neural Information Processing Systems. Long Beach, USA: MIT Press, 2017: 5998–6008.
[17] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[C]//Proceedings of the 9th International Conference on Learning Representations. Online: ACM, 2021: 1-6.
[18] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 10012-10022.
[19] CHEN, RICHARD C F, FAN Q, et. al. Crossvit: cross-attention multi-scale vision transformer for image classification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 357-366.
[20] Deepfakes [EB/OL]. (2019-10-01) [2023-09-20].https://github.com/deepfakes/faceswap.
[21] KORSHUNOVA I, SHI W, DAMBRE J, et al. Fast face-swap using convolutional neural networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 3677-3685.
[22] LI L, BAO J, YANG H, et al. Advancing high fidelity identity swapping for forgery detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 5074-5083.
[23] CHEN R, CHEN X, NI B, et al. Simswap: an efficient framework for high fidelity face swapping[C]//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM, 2020: 2003-2011.
[24] THIES J, ZOLLHOFER M, STAMMINGER M, et al. Face2face: real-time face capture and reenactment of rgb videos[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 2387-2395.
[25] NIRKIN Y, KELLER Y, HASSNER T. FSGAN: subject agnostic face swapping and reenactment[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE, 2019: 7184-7193.
[26] 朱凯, 李理, 张彤等. 视觉Transformer在低级视觉领域的研究综述[J/OL]. 计算机工程与应用.https://link.cnki.net/urlid/11.2127.TP.20230817.1249.004
ZHU K, LI L, ZHANG T, et al. A survey of vision transformer in low-level computer vision[J/OL]. Computer Engineering and Applications.https://link.cnki.net/urlid/11.2127.TP.20230817.1249.004
[27] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]//International Conference on Machine Learning. Online: PMLR, 2021: 10347-10357.
[28] WU H, XIAO B, CODELLA N, et al. CvT: Introducing convolutions to vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 22-31.
[29] D'ASCOLI S, TOUVRON H, LEAVITT M L, et al. Convit: Improving vision transformers with soft convolutional inductive biases[C]//Proceedings of the 38th International Conference on Machine Learning. Online: ACM, 2021: 2286-2296.
[30] CHU X, TIAN Z, WANG Y, et al. Twins: revisiting the design of spatial attention in vision transformers [J]. Advances in Neural Information Processing Systems, 2021, 34: 9355-9366.
[31] CHEN R, PANDA R, FAN Q. RegionViT: regional-to-local attention for vision transformers[C]//International Conference on Learning Representations. Online: ACM, 2022.
[32] WANG W, XIE E, LI X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 568-578.
[33] YUAN L, CHEN Y, WANG T, et al. Tokens-to-token ViT: Training vision transformers from scratch on imagenet[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 558-567.
[34] ZHOU D, KANG B, JIN X, et al. DeepViT: Towards deeper vision transformer[EB/OL]. arXiv: 2103.11886(2021-04-19) [2023-09-20].https://doi.org/10.48550/arXiv.2103.11886.
[35] TOUVRON H, CORD M, SABLAYROLLES A, et al. Going deeper with image transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 32-42.
[36] DONG X, BAO J, CHEN D, et al. Protecting celebrities from deepfake with identity consistency transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE, 2022: 9468-9478.
[37] CHEN H, LIN Y, LI B, et al. Learning features of intra-consistency and inter-diversity: keys toward generalizable deepfake detection [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(3): 1468-1480.
[38] WANG J, WU Z, OUYANG W, et al. M2tr: multi-modal multi-scale transformers for deepfake detection[C]//Proceedings of the 2022 International Conference on Multimedia Retrieval. Newark, USA: ACM, 2022: 615-623.
[39] TAN Z, YANG Z, MIAO C, et al. Transformer-based feature compensation and aggregation for deepfake detection [J]. IEEE Signal Processing Letters, 2022, 29: 2183-2187.
[40] MIAO C, TAN Z, CHU Q, et al. Hierarchical frequency-assisted interactive networks for face manipulation detection [J]. IEEE Transactions on Information Forensics and Security, 2022, 17: 3008-3021.
[41] MIAO C, TAN Z, CHU Q, et al. F2Trans: high-frequency fine-grained transformer for face forgery detection [J]. IEEE Transactions on Information Forensics and Security, 2023, 18: 1039-1051.
[42] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural computation, 1997, 9(8): 1735-1780.
[43] ZHENG Y, BAO J, CHEN D, et al. Exploring temporal coherence for more general video face forgery detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 15044-15054.
[44] GUAN J, ZHOU H, HONG Z, et al. Delving into sequential patches for deepfake detection [J]. Advances in Neural Information Processing Systems, 2022, 35: 4517-4530.
[45] ZHAO C, WANG C, HU G, et al. ISTVT: interpretable spatial-temporal video transformer for deepfake detection [J]. IEEE Transactions on Information Forensics and Security, 2023, 18: 1335-1348.
[46] YU Y, NI R, ZHAO Y, et al. MSVT: multiple spatiotemporal views transformer for deepfake video detection [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(9): 4462-4471.
[47] CHENG H, GUO Y, WANG T, et al. Voice-face homogeneity tells deepfake[EB/OL]. arXiv: 2203.02195(2022-06-13) [2023-09-20].https://doi.org/10.48550/arXiv.2203.02195.
[48] ILYAS H, JAVED A, MALIK K M. AVFakeNet: A unified end-to-end dense swin transformer deep learning model for audio-visual deepfakes detection [J]. Applied Soft Computing, 2023, 136: 110124.
[49] YANG W, ZHOU X, CHEN Z, et al. AVoiD-DF: audio-visual joint learning for detecting deepfake [J]. IEEE Transactions on Information Forensics and Security, 2023, 18: 2015-2029.
[50] FENG C, CHEN Z, OWENS A. Self-supervised video forensics by audio-visual anomaly detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE, 2023: 10491-10503.
[51] YU Y, LIU X, NI R, et al. PVASS-MDD: predictive visual-audio alignment self-supervision for multimodal deepfake detection[J]. IEEE Transactions on Circuits and Systems for Video Technology.https://ieeexplore.ieee.org/document/10233898
[52] DeepfakeDetection [EB/OL]. (2019-10-01) [2023-09-30].https://github.com/ondyari/FaceForensics.
[53] DOLHANSKY B, HOWES R, PFLAUM B, et al. The deepfake detection challenge (DFDC) preview dataset[EB/OL]. arXiv: 1910.08854(2019-10-23) [2023-09-20].https://doi.org/10.48550/arXiv.1910.08854.
[54] LI Y, YANG X, SUN P, et al. Celeb-DF: a large-scale challenging dataset for deepfake forensics[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 3207-3216.
[55] LI Y, YANG X, SUN P, et al. Celeb-DF (v2): a new dataset for deepFake forensics [EB/OL]. arXiv: 1909.12962v3. (2019-11-22) [2023-09-20].https://doi.org/10.48550/arXiv:1909.12962v3.
[56] ZI B, CHANG M, CHEN J, et al. WildDeepfake: a challenging real-world dataset for deepfake detection[C]//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM, 2020: 2382-2390.
[57] DOLHANSKY B, BITTON J, PFLAUM B, et al. The deepfake detection challenge (DFDC) dataset [EB/OL]. arXiv: 2006.07397. (2020-10-28) [2023-09-20]. https://doi.org/10.48550/arXiv.2006.07397.
[58] JIANG L, LI R, WU W, et al. Deeperforensics-1.0: a large-scale dataset for real-world face forgery detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 2889-2898.
[59] DONG X, BAO J, CHEN D, et al. Identity-driven deepfake detection[EB/OL]. arXiv: 2012.03930(2022-09-07) [2023-09-20].https://doi.org/10.48550/arXiv.2012.03930.
[60] ZHOU T, WANG W, LIANG Z, et al. Face forensics in the wild[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Online: IEEE, 2021: 5778-5788.
[61] HE Y, GAN B, CHEN S, et al. ForgeryNet: a versatile benchmark for comprehensive forgery analysis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Online: IEEE, 2021: 4360-4369.
[62] KWON P, YOU J, NAM G, et al. KoDF: a large-scale Korean deepfake detection dataset[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 10744-10753.
[63] CAI Z, STEFANOV K, DHALL A, et al. Do you really mean that? content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization[C]//2022 International Conference on Digital Image Computing: Techniques and Applications. Online: IEEE, 2022: 1-10.
[64] NAGRANI A, CHUNG J, ZISSERMAN A. VoxCeleb: a large-scale speaker identification dataset[EB/OL]. arXiv: 1706.08612 (2017-06-26) [2023-09-20].https://doi.org/10.48550/arXiv.1706.08612.
[65] CHUNG J, NAGRANI A, ZISSERMAN A. VoxCeleb2: deep speaker recognition[EB/OL]. arXiv: 1806.05622(2017-06-26) [2023-09-20].https://doi.org/10.48550/arXiv.1806.05622.
[1] Feng Guang, Bao Long. Face Recognition Method in Complex Environment Based on Infrared Visible Fusion [J]. Journal of Guangdong University of Technology, 2024, 41(03): 62-70,109.
[2] Guo Ao, Xu Bo-yan, Cai Rui-chu, Hao Zhi-feng. Temporal Alignment Style Control in Text-to-Speech Synthesis Algorithm [J]. Journal of Guangdong University of Technology, 2024, 41(02): 84-92.
[3] Zhang Miao, Pang Zhuo-biao, Hao Xue-dong, Xie Si-wei, Zhang Xing-wang. A Research on a Transformerless Parallel Hybrid Active Power Filter [J]. Journal of Guangdong University of Technology, 2019, 36(05): 33-37.
[4] Dong Wen-hua, Li Chun-lai, Lan Xiong. Design and Experimental Analysis of an Open-close Micro Current Transformer [J]. Journal of Guangdong University of Technology, 2019, 36(04): 65-69.
[5] He Rui-wen, Xie Qiong-xiang, Cai Ze-xiang. Influence of Digital Acquisition of the Electrical Information on the Reliability of Relay Protection [J]. Journal of Guangdong University of Technology, 2013, 30(2): 68-73.
[6] Chen He-en, , Feng Kai-ping, Pan Li-pei, Wu Yue-ming, . Study of Architecture Transformation [J]. Journal of Guangdong University of Technology, 2012, 29(2): 94-96.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!