Banner Portal
A qualitative systematic review of intra-speaker variation in the human voice
PDF

Keywords

Deepfake
Intra-speaker variation
Acoustics
Speech
Artificial inteligence

How to Cite

1.
San Segundo E, Delgado J. A qualitative systematic review of intra-speaker variation in the human voice. J. of Speech Sci. [Internet]. 2025 Oct. 13 [cited 2025 Oct. 18];14(00):e025019. Available from: https://econtents.sbu.unicamp.br/inpec/index.php/joss/article/view/20738

Funding data

Abstract

Audio deepfake detection is essential for addressing societal challenges such as differentiating real news from fake content or authenticating voice recordings in legal contexts. However, identifying whether a voice is human or AI-generated requires knowing which characteristics to examine, and the choice of voice features for this task is relatively unguided. This justifies the systematic review presented in this paper. Hypothesizing that human voices exhibit more intra-speaker variation than deepfakes, the aim of this review has been to summarize and analyze the published studies on the topic of intra-speaker variation in human voice. A systematic search was conducted in Web of Science, the Cochrane Library, and the electronic database of the International Journal of Speech Language and the Law, initially identifying 305 studies. After removing duplicates and applying inclusion/exclusion criteria, 36 articles were selected for analysis. Findings highlight speaking style as a major factor in intra-speaker variation affecting various acoustic parameters. This review suggests that experts may prioritize features that show higher within-speaker variation, while noting that their utility for deepfake detection must be verified on deepfake datasets.

PDF

References

1. Barrington, S., Cooper, E.A. & Farid, H. People are poorly equipped to detect AI-powered voice clones. Scientific Reports 15, 11004 (2025). https://doi.org/10.1038/s41598-025-94170-3

2. Ciancaglini V, Gibson C, Sancho D, McCarthy O, Eira M, Amann P, Klayn A, Malicious Uses and Abuses of Artificial Intelligence. Trend Micro Research, United Nations Interregional Crime and Justice Research Institute & Europol’s European Cybercrime Centre; 2020 November 19.

3. Pfefferkorn R. Too Good to Be True? “Deepfakes” Pose a New Challenge for Trial Courts. NWLawyer, Washington State Bar Association; 2019 September.

4. McPeak A. The threat of deepfakes in litigation: Raising the authentication bar to combat falsehood. Vanderbilt Journal of Entertainment and Technology Law. 2020 23: 433–450.

5. Z. Khanjani, G. Watson, and V. P. Janeja: “Audio deepfakes: A survey,” Frontiers in Big Data, vol. 5, pp.1-24, 2023.

6. Yamagishi, J., Todisco, M., Sahidullah, M., Delgado, H., Wang, X., Evans, N., ... & Nautsch, A. (2019). Asvspoof 2019: Automatic speaker verification spoofing and countermeasures challenge evaluation plan. ASV Spoof, 13.

7. WIRED. (2023, October 3). Slovakia’s election deepfakes show AI is a danger to democracy. WIRED. https://www.wired.com/story/slovakias-election-deepfakes-show-ai-is-a-danger-to-democracy

8. Müller, N. M., Czempin, P., Dieckmann, F., Froghyar, A., & Böttinger, K. (2022). Does audio deepfake detection generalize?. arXiv preprint arXiv:2203.16263.

9. Jessen M. Forensic Voice Comparison. In Visconti J, editor. Handbook of Communication in the Legal Sphere. Berlín: De Gruyter Mouton, 2018. http://dx.doi.org/10.1515/9781614514664-012

10. French P, Stevens L. Forensic speech science. In Jones MJ and Knight RA, Editors. Bloomsbury companion to Phonetics. Londres: Continuum, 2005.

11. Jiménez-Peña J, Torres-Castillo FA, Cueva-Sánchez OE. Comparación forense de voces: un estudio preliminar sobre las diferencias entre una voz natural y una voz artificial para la investigación judicial. Revista Oficial del Poder Judicial. 2024 16 (21): 53–81. https://doi.org/10.35292/ropj.v16i21.881

12. Pellegrino E, Kathiresan T, Roswandovitz C, Fruholz S, Dellwo V. Can prosody be the key to spot fake voices? Acoustic and automatic speaker verification analyses on digital and natural voices. Paper presented at the 28th Conference of the International Association for Forensic Phonetics and Acoustics (IAFPA). Istanbul; 20019 July.

13. Tan CB, Hijazi MHA, Khamis N, Zainol Z, Coenen F, Gani A. A survey on presentation attack detection for automatic speaker verification systems: State-of-the-art, taxonomy, issues and future direction. Multimedia Tools and Applications. 2021; 80 (21): 32725–32762.

14. Wolf J, Efficient acoustic parameters for speaker recognition. The Journal of the Acoustical Society of America. 1972; 51(6B): 2044–2056.

15. Nolan F. The phonetic bases of speaker recognition. Cambridge: Cambridge University Press; 1983. http://dx.doi.org/10.1016/0167-6393(87)90039-2

16. Kinnunen T, Li H. An overview of text-independent speaker recognition: from features to supervectors. Speech Communication. 2010; 52(1): 12–40.

17. Zhang C, van de Weijer J, Cui J. Intra- and inter-speaker variations of formant pattern for lateral syllables in Standard Chinese. Forensic science international. 2006; 158(2-3): 117–124. https://doi.org/10.1016/j.forsciint.2005.04.043

18. Rhodes R. Aging effects on voice features used in forensic speaker comparison. International Journal of Speech, Language and the Law. 2017; 24(2): 177–199. https://doi.org/10.1558/ijsll.34096

19. Ross A, Corley M, Lai C. Is there an uncanny valley for speech? Investigating listeners’ evaluations of realistic TTS voices. Proc. Speech Prosody 2024. 2024; 1115-1119. https://doi.org/10.21437/SpeechProsody.2024-225

20. Bell A. Style in dialogue: Bakhtin and sociolinguistic theory. In Bayley R and Lucas C, editors. Sociolinguistic variation: theories, methods, and applications. Cambridge: Cambridge University Press; 2007, p. 90–109.

21. Bülow L, Pfenninger SE. Introduction: Reconciling approaches to intra-individual variation in psycholinguistics and variationist sociolinguistics. Linguistics Vanguard. 2021, 7(s2), 20200027.

22. Coupland N. Language, situation, and the relational self: Theorizing dialect-style in sociolinguistics. In Eckert P and Rickford JR, editors. Style and sociolinguistic variation. Cambridge: Cambridge University Press; 2001, p. 185–210.

23. Labov W. Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press; 1972.

24. Covidence systematic review software. Veritas Health Innovation, Melbourne, Australia; [Accessed: February 2023]. Available at: www.covidence.org.

25. Thomas J, Harden A. Methods for the thematic synthesis of qualitative research in systematic reviews. BMC Med Res Methodol. 2008; 8: 45. https://doi.org/10.1186/1471-2288-8-45

26. Critical Appraisal Skills Programme: CASP Qualitative Checklist. [online]; 2018 [Accessed: 08/08/2024]. Available at: https://casp-uk.net/checklists/casp-qualitative-studies-checklist.pdf

27. Priva UC, Edelist L, Gleason E. Converging to the baseline: Corpus evidence for convergence in speech rate to interlocutor’s baseline. Journal of the Acoustical Society of America. 2017; 141(5): 2989-2996. https://doi.org/10.1121/1.4982199

28. Jacewicz E, Fox RA, Wei L. Between-speaker and within-speaker variation in speech tempo of American English. Journal of the Acoustical Society of America. 2010; 128(2): 839-850. https://doi.org/10.1121/1.3459842

29. Quene H. Multilevel modeling of between-speaker and within-speaker variation in spontaneous speech tempo. Journal of the Acoustical Society of America. 2008; 123(2): 1104-1113. https://doi.org/10.1121/1.2821762

30. Eriksson EJ, Sullivan KPH. An investigation of the effectiveness of a Swedish glide + vowel segment for speaker discrimination. International Journal of Speech, Language and the Law. 2008; 15(1): 51-66. https://doi.org/10.1558/ijsll.v15i1.51

31. McDougall K. Speaker-specific formant dynamics: An experiment on Australian English /aI/. International Journal of Speech, Language and the Law. 2004; 11(1): 103-130. https://doi.org/10.1558/sll.2004.11.1.103

32. Heeren WFL. The effect of word class on speaker-dependent information in the Standard Dutch vowel/a:/. Journal of the Acoustical Society of America. 2020; 148(4): 2028-2039. https://doi.org/10.1121/10.0002173

33. Kuo C, Weismer G. Vowel reduction across tasks for male speakers of American English. Journal of the Acoustical Society of America. 2016; 140(1): 369-383. https://doi.org/10.1121/1.4955310

34. Nadeu M, Renwick M. Variation in the lexical distribution and implementation of phonetically similar phonemes in Catalan. Journal of Phonetics. 2016; 58: 22-47. https://doi.org/10.1016/j.wocn.2016.05.003

35. Harrington L, Rhodes R, Hughes, V. Style variability in disfluency analysis for forensic speaker comparison. International Journal of Speech, Language and the Law. 2021; 28(1): 31-58. https://doi.org/10.1558/ijsll.20214

36. Hughes V, Wood S, Foulkes P. Strength of forensic voice comparison evidence from the acoustics of filled pauses. International Journal of Speech, Language and the Law. 2016; 23(1): 99-132. https://doi.org/10.1558/ijsll.v23i1.29874

37. McDougall K, Duckworth M. Individual patterns of disfluency across speaking styles: A forensic phonetic investigation of Standard Southern British English. International Journal of Speech, Language and the Law. 2018; 25(2): 205-230. https://doi.org/10.1558/ijsll.37241

38. Dellwo V, Leemann A, Kolly MJ. Rhythmic variability between speakers: Articulatory, prosodic, and linguistic factors. Journal of the Acoustical Society of America. 2015; 137(3): 1513-1528. https://doi.org/10.1121/1.4906837

39. Leemann A, Kolly MJ, Dellwo V. Speaker-individuality in suprasegmental temporal features: Implications for forensic voice comparison. Forensic Science International. 2014; 238: 59-67. https://doi.org/10.1016/j.forsciint.2014.02.019

40. Leeman A, Mixdorff H, O’Reilly M, Kolly MJ, Dellwo V. Speaker-individuality in Fujisaki model f0 features: Implications for forensic voice comparison. International Journal of Speech, Language and the Law. 2015; 21(2): 343-370. https://doi.org/10.1558/ijsll.v21i2.343

41. Jessen M, Koster O, Gfroerer S. Influence of vocal effort on average and variability of fundamental frequency. International Journal of Speech, Language and the Law. 2005; 12(2): 174-213. https://doi.org/10.1558/sll.2005.12.2.174

42. Lee B, Sidtis D. The bilingual voice: Vocal characteristics when speaking two languages across speech tasks. Speech Language and Hearing. 2017; 20(3): 174-185. https://doi.org/10.1080/2050571X.2016.1273572

43. Künzel HJ. Effects of voice disguise on speaking fundamental frequency. International Journal of Speech, Language and the law. 2000; 7(2): 150-179.

44. Schiel F, Heinrich C. Disfluencies in the speech of intoxicated speakers. International Journal of Speech, Language and the Law. 2015; 22(1): 19-34. https://doi.org/10.1558/ijsll.v22i1.24767

45. Asadi H, Nourbakhsh M, He L, Pellegrino E, Dellwo V. Between-speaker rhythmic variability is not dependent on language rhythm, as evidence from Persia reveals. International Journal of Speech, Language and the Law. 2018; 25(2),: 151-174. https://doi.org/10.1558/ijsll.37110

46. Cavalcanti JC, Eriksson A, Barbosa PA. Multi-parametric analysis of speech timing in inter-talker identical twin pairs and cross-pair comparisons: Some forensic implications. Plos One. 2020; 17(1). https://doi.org/10.1371/journal.pone.0262800

47. He L, Dellwo V. The role of syllable intensity in between-speaker rhythmic variability. International Journal of Speech, Language and the Law. 2016; 23(2): 243-273. https://doi.org/10.1558/ijsll.v23i2.30345

48. Atkinson J. Interspeaker and intraspeaker variability in fundamental voice frequency. Journal of the Acoustical Society of America. 1976;60(2): 440-445. https://doi.org/10.1121/1.381101

49. Gandour J, Potisuk S, Ponglorpisit S, Dechongkit S. Interspeaker and intraspeaker variability in fundamental-frequency of thai tones. Speech Communication. 1991; 10(4); 355-372. https://doi.org/10.1016/0167-6393(91)90003-C

50. Correa J, Rodriguez L. (2018). Phonetic reduction of the consonant sequence /-st-/ in Bogota Spanish. Estudios Filologicos. 2018; 62: 193-214. https://doi.org/10.4067/S0071-17132018000200193

51. Haley K, Seelinger E, Mandulak K, Zajac D. Evaluating the spectral distinction between sibilant fricatives through a speaker-centered approach. Journal of Phonetics. 2010; 38(4): 548-554. https://doi.org/10.1016/j.wocn.2010.07.006

52. Munson B. A method for studying variability in fricatives using dynamic measures of spectral mean. Journal of the Acoustical Society of America. 2010; 110(2): 1203-1206. https://doi.org/10.1121/1.1387093

53. Harmegnies, B., & Landercy, A. (1988). Intra-speaker variability of the long term speech spectrum. Speech communication, 7(1), 81-86.

54. Jessen M. Speaker-specific information in voice quality parameters. International Journal of Speech, Language and the Law. 1997; 4(1): 84-103. https://doi.org/10.1558/ijsll.v4i1.84

55. Lee Y, Keating P, Kreiman J. Acoustic voice variation within and between speakers. Journal of the Acoustical Society of America. 2019; 146(3): 1568-1579. https://doi.org/10.1121/1.5125134

56. Lindsey G, Hirson A. Variable robustness of nonstandard /r/ in English: Evidence from accent disguise. International Journal of Speech, Language and the Law. 1999; 6(2): 278-289. https://doi.org/10.1558/sll.1999.6.2.278

57. Wassink A, Wright R, Franklin A. Intraspeaker variability in vowel production: An investigation of motherese, hyperspeech, and Lombard speech in Jamaican speakers. Journal of Phonetics. 2007; 35(3): 363-379. https://doi.org/10.1016/j.wocn.2006.07.002

58. Weirich M, Simpson A. Effects of Gender, Parental Role, and Time on Infant- and Adult-Directed Read and Spontaneous Speech. Journal of Speech, Language and Hearing Research. 2019; 62(11): 4001-4014. https://doi.org/10.1044/2019_JSLHR-S-19-0047

59. Amino K, Osanai T. Speaker characteristics that appear in vowel nasalisation and their change over time. Acoustical Science and Technology. 2012; 33(2): 96-105. https://doi.org/10.1250/ast.33.96

60. Galata V, Spreafico L, Vietti A, Kaland C. An acoustic analysis of /r/ in Tyrolean (WOS:000409394400208). 2016; 1002-1006. https://doi.org/10.21437/Interspeech.2016-434

61. Yi J, Wang C, Tao J, Zhang X, Zhang CY, Zhao Y. Audio deepfake detection: A survey. arXiv preprint arXiv. 2023; 2308.14970.

62. Yang T, Sun C, Lyu S, Rose P. Forensic deepfake audio detection using segmental speech features. arXiv preprint arXiv. 2025: 2505.13847. https://doi.org/10.48550/arXiv.2505.13847.

63. San Segundo, E. (2024). Systematic review protocol. A qualitative systematic review of intra-speaker variation in the human voice. Zenodo. https://doi.org/10.5281/zenodo.13904591.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright (c) 2025 Eugenia San Segundo, Jonathan Delgado