Vol. 14 (2025), Dossier

Vol. 14 (2025)

Textual patterns and machine learning classification in academic writing: a linguistic analysis of theses and dissertations from a Brazilian graduate program

Dossier

Published 2025-10-13

Crysttian Arantes Paixão⁺⁻

Crysttian Arantes Paixão

Federal University of Bahia

https://orcid.org/0000-0002-3809-4490

PDF

Keywords

Computational Linguistics
Brazilian Portuguese
Corpus Linguistics

Métricas

How to Cite

Paixão CA. Textual patterns and machine learning classification in academic writing: a linguistic analysis of theses and dissertations from a Brazilian graduate program. J. of Speech Sci. [Internet]. 2025 Oct. 13 [cited 2026 Mar. 15];14(00):e025018. Available from: https://econtents.sbu.unicamp.br/inpec/index.php/joss/article/view/20586

Abstract

This study investigates linguistic patterns in academic texts produced within the Graduate Program in Linguistic Studies (PosLin) at the Federal University of Minas Gerais. A corpus comprising 1,270 documents, 730 master's dissertations and 540 doctoral theses was compiled and analyzed using computational linguistic techniques. Exploratory analyses included the extraction of unigrams, bigrams, trigrams, and the classification of the most frequent tokens into morphological categories (nouns, verbs, adjectives and adverbs). Despite the shared institutional context and research tracks, subtle differences in lexical and structural features were observed between the two academic levels. To evaluate whether these differences could support automated classification, machine learning models were trained on bag-of-words representations of the texts. Gradient Boosting emerged as the most effective algorithm, achieving an AUC of 0.989 with only the 1,000 most frequent tokens, demonstrating that high classification accuracy can be reached without extensive computational overhead. The results show that textual analysis combined with supervised learning can effectively distinguish academic genres within a single graduate program. Furthermore, the approach holds potential for broader applications in genre classification, fake news detection, and discourse analysis. This study also reinforces the importance of continued research in computational linguistics for underrepresented languages such as Brazilian Portuguese, especially in the context of formal and academic writing.

PDF

References

1. Brazil. Ministry of Education (MEC), CAPES. Stricto sensu graduate programs in Brazil surpassed 350,000 enrollments in 2023. [Internet]. [cited 2025 Aug 25]. Available from: https://tinyurl.com/yc2f5e28.

2. Programa de Pós-Graduação em Estudos Linguísticos - POSLIN [Internet]. [cited 2025 Aug 25]. Available from: http://www.poslin.letras.ufmg.br/

3. Soares F, Yamashita GH, Anzanello MJ. A Parallel Corpus of Theses and Dissertations Abstracts. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) [Internet]. 2018 [cited 2025 Aug 25];11122 LNAI:345–52. Available from: https://tinyurl.com/mrasvzhv

4. Kauffmann CH. Linguística de corpus e estilo: análises multidimensional e canônica na ficção de Machado de Assis. 2020 Apr 30 [cited 2025 Aug 25]; Available from: https://tinyurl.com/mrb85unk

5. Owa DLM. Estudo comparativo entre análise multidimensional lexical e modelagem de tópicos. 2021;

6. Nunes W da C. A identificação de metáforas em corpus jornalístico comparável bilíngue: estudo contrastivo espanhol/português. 2023 May 30 [cited 2025 Aug 25]; Available from: https://tinyurl.com/2te87dwu

7. Kutuzov A, Kopotev M, Sviridenko T, Ivanova L. Clustering Comparable Corpora of Russian and Ukrainian Academic Texts: Word Embeddings and Semantic Fingerprints. 2016 Apr 18 [cited 2025 Aug 25]; Available from: https://arxiv.org/pdf/1604.05372

8. Yilmaz S, Römer U. A corpus-based exploration of constructions in written academic English as a lingua franca. 2020 Feb 20;59–88.

9. Barbosa GA, Batista HHN, Miranda P, Santos J, Isotani S, Cordeiro T, et al. Aprendizagem de Máquina para Classificação de Tipos Textuais: Estudo de Caso em Textos escritos em Português Brasileiro. Simpósio Brasileiro de Informática na Educação (SBIE) [Internet]. 2022 Nov 16 [cited 2025 Aug 26];920–31. Available from: https://tinyurl.com/2mtvfpdf

10. Cândido ECR. Um estudo comparativo de redes neurais profundas para classificação automática de texto. 2020 Feb 14 [cited 2025 Aug 25]; Available from: https://tinyurl.com/4wk3k2ns

11. Li X. Text classification using topic modelling and machine learning [Internet]. Nanyang Technological University; 2024 [cited 2025 Aug 25]. Available from: https://hdl.handle.net/10356/176723

12. Humaidi MH, Sutrisno, Laksono PW. Implementation of Machine Learning for Text Classification Using the Naive Bayes Algorithm in Academic Information Systems at Sebelas Maret University Indonesia. E3S Web of Conferences [Internet]. 2023 Dec 18 [cited 2025 Aug 26];465:02048. Available from: https://tinyurl.com/m7pvthw2

13. Fiorin JL, editor. Introdução à Linguística I: objetos teóricos. 6th ed. São Paulo: Contexto; 2010. 226.

14. Jurafsky D, Martin JH. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models [Internet]. 3rd ed. 2025. Available from: https://web.stanford.edu/~jurafsky/slp3/

15. Moraes LC, Silvério IC, Marques RAS, Anaia BC, de Paula DF, de Faria MCS, et al. Linguistic ambiguity analysis in large language models (LLMs). Texto Livre [Internet]. 2025 [cited 2025 Aug 25];18:e53181. Available from: https://tinyurl.com/yrf845h8

16. Bagno Marcos. Preconceito linguístico. 56th ed. Parábola Editorial; 2015 [cited 2025 Aug 26]. 352.

17. Goldberg Y. Neural Network Methods for Natural Language Processing. 2017 [cited 2025 Aug 25]; Available from: https://link.springer.com/10.1007/978-3-031-02165-7

18. She X, Zhang D. Text Classification Based on Hybrid CNN-LSTM Hybrid Model. Proceedings - 2018 11th International Symposium on Computational Intelligence and Design, ISCID 2018. 2018 Jul 2;2:185–9.

19. Goodfellow I, Bengio Y, Courville A. Deep Learning. MIT Press; 2016.

20. Bishop CM. Pattern Recognition and Machine Learning. 1st ed. New York: Springer; 2006. (Information Science and Statistics).

21. Géron Aurélien. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. 2022 [cited 2025 Aug 25]; Available from: https://tinyurl.com/bdexmfra

22. Carvalho ACPLF, Menezes ÁG, Bonidia RP. Ciência de Dados – Fundamentos e Aplicações. São Paulo: LTC; 2024. 376.

23. Morettin PA, Singer J da M. Estatística e ciência de dados. 2025 ;[citado 2025 ago. 26 ]

24. Hsu BM. Comparison of Supervised Classification Models on Textual Data. Mathematics. 2020 Aug;8:851.

25. Britto FA, Ferreira TC, Nunes LP, Parreiras FS. Comparing Supervised Machine Learning Techniques for Genre Analysis in Software Engineering Research Articles [Internet]. 2021 [cited 2025 Aug 25]. p. 63–72. Available from: https://aclanthology.org/2021.ranlp-1.8/

26. Almatarneh S, Gamallo P. Comparing Supervised Machine Learning Strategies and Linguistic Features to Search for Very Negative Opinions. Information 2019, Vol 10, Page 16 [Internet]. 2019 Jan 4 [cited 2025 Aug 25];10(1):16. Available from: https://www.mdpi.com/2078-2489/10/1/16/htm

27. Allam H, Makubvure L, Gyamfi B, Graham KN, Akinwolere K. Text Classification: How Machine Learning Is Revolutionizing Text Categorization. Information 2025, Vol 16, Page 130 [Internet]. 2025 Feb 10 [cited 2025 Aug 26];16(2):130. Available from: https://www.mdpi.com/2078-2489/16/2/130/htm

28. Programa de Pós-Graduação em Estudos Linguísticos - POSLIN [Internet]. [cited 2025 Aug 25]. Available from: http://www.poslin.letras.ufmg.br/bancodefesas.php

29. R Core Team. R: A Language and Environment for Statistical Computing [Internet]. Vienna, Austria; 2023. Available from: https://www.R-project.org/

29. Harris ZS. Distributional Structure. WORD [Internet]. 1954 Aug [cited 2025 Aug 25];10(2–3):146–62. Available from: https://tinyurl.com/2wnaefsh

31. Demšar J, Curk T, Erjavec A, Gorup Č, Hočevar T, Milutinović M, et al. Orange: Data Mining Toolbox in Python. Journal of Machine Learning Research [Internet]. 2013;14(Aug):2349–53. Available from: http://jmlr.org/papers/v14/demsar13a.html

32. Posit team. RStudio: Integrated Development Environment for R [Internet]. Boston, MA; 2025. Available from: http://www.posit.co/

33. Feinerer I, Hornik K, Meyer D. Text Mining Infrastructure in R. Journal of Statistical Software [Internet]. 2008 Mar 31 [cited 2025 Aug 25];25(5):1–54. Available from: https://tinyurl.com/ye27uru5

34. Hornik K, Meyer D, Buchta C. slam: Sparse Lightweight Arrays and Matrices [Internet]. 2024. Available from: https://CRAN.R-project.org/package=slam

35. Fellows I. wordcloud: Word Clouds [Internet]. 2018. Available from: https://CRAN.R-project.org/package=wordcloud

36. Neuwirth E. RColorBrewer: ColorBrewer Palettes [Internet]. 2022. Available from: https://CRAN.R-project.org/package=RColorBrewer

37. Wickham H, Hester J, Bryan J. readr: Read Rectangular Text Data [Internet]. 2024. Available from: https://CRAN.R-project.org/package=readr

38. Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, et al. quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software [Internet]. 2018;3(30):774. Available from: https://quanteda.io

39. Bick E. The Parsing System ``Palavras’’: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. In: Proceedings of the International Conference on Computational Processing of the Portuguese Language (PROPOR 2000). Évora, Portugal: Springer; 2000. p. 35–45. (Lecture Notes in Artificial Intelligence; vol. 2721).

40. Swales JM. Genre Analysis: English in Academic and Research Settings. Cambridge, UK: Cambridge University Press; 1990.

41. Neves MH de M. Gramática de usos do português. 2nd ed. São Paulo: Editora UNESP; 2011.

42. Rawlins JD, Eckstein G, Hanks E, Lester EW, Wilde L, Bartholomew R. Intentional function and frequency of reporting verbs across six disciplines: A cluster analysis. International Journal of English for Academic Purposes: Research and Practice [Internet]. 2024 Mar 6 [cited 2025 Aug 26];4(1):47–71. Available from: https://tinyurl.com/ycyrhvzn

43. Larsson T, Callies M, Hasselgård H, Laso NJ, Vuuren S van, Verdaguer I, et al. Adverb placement in EFL academic writing: Going beyond syntactic transfer. International Journal of Corpus Linguistics [Internet]. 2020 Aug 28 [cited 2025 Aug 26];25(2):156–85. Available from: https://tinyurl.com/37h8kyu8

44. Fernandes ICS. Marcadores discursivos e efeitos de sentido: além das fronteiras dos estudos sobre coesão. Estudos Linguisticos [Internet]. 4º de abril de 2016 [citado 27º de agosto de 2025];42(3):1073-87. Disponível em: https://revistas.gel.org.br/estudos-linguisticos/article/view/915

45. Noguti MY, Vellasques E, Oliveira LES. A Small Claims Court for the NLP: Judging Legal Text Classification Strategies With Small Datasets. Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics [Internet]. 2024 Sep 9 [cited 2025 Aug 25];1840–5. Available from: http://arxiv.org/abs/2409.05972

46. Lu J, Henchion M, Bacher I, Namee B mac. A Sentence-level Hierarchical BERT Model for Document Classification with Limited Labelled Data. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) [Internet]. 2021 Jun 12 [cited 2025 Aug 25];12986 LNAI:231–41. Available from: https://arxiv.org/pdf/2106.06738

This work is licensed under a Creative Commons Attribution 4.0 International License.

Downloads

Download data is not yet available.