GA-LDA Approach for Topic Modeling in Turkish Accounting and Finance Articles: Performance Optimization in Text Classification
DOI:
https://doi.org/10.31181/sor21202521Keywords:
Topic modeling, Text mining, LDA, Genetic AlgorithmAbstract
The volume of research in the social sciences is expanding rapidly, creating significant challenges in extracting meaningful insights from unstructured text, particularly from articles lacking a classification system. Analyzing these high-volume texts offers numerous advantages, including the ability to automatically identify topic relevance and track thematic trends over time. Such insights are valuable for journal management and enable researchers to access detailed information about evolving areas of study. Latent Dirichlet Allocation (LDA) is a widely used method for topic modeling, effectively extracting topics from textual data. However, its performance can be further enhanced through optimization techniques such as Genetic Algorithms (GA). This study introduces an intelligent GA-LDA framework designed to optimize word subsets for LDA, thereby improving its predictive capabilities. The proposed system is applied to a dataset of 928 abstracts from a Turkish-language academic journal specializing in accounting and finance, covering publications from 2005 to 2020. The genetic algorithm selects optimal word subsets for LDA analysis, with perplexity scores serving as the fitness function to guide the optimization process. Experimental results demonstrate that the GA-enhanced LDA significantly improves classification accuracy and topic modeling performance. This study not only underscores the potential of GA-LDA in handling unstructured text but also provides a robust tool for advancing automated content analysis in Turkish academic literature.
Downloads
References
Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia tools and applications, 78, 15169-15211. https://doi.org/10.1007/s11042-018-6894-4
Kim, S., Park, H., & Lee, J. (2020). Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: A study on blockchain technology trend analysis. Expert Systems with Applications, 152, 113401. https://doi.org/10.1016/j.eswa.2020.113401
Agrawal, A., Fu, W., & Menzies, T. (2018). What is wrong with topic modeling? And how to fix it using search-based software engineering. Information and Software Technology, 98, 74-88. https://doi.org/10.1016/j.infsof.2018.02.005
Ding, S., Li, Z., Liu, X., Huang, H., & Yang, S. (2019). Diabetic complication prediction using a similarity-enhanced latent Dirichlet allocation model. Information Sciences, 499, 12-24. https://doi.org/10.1016/j.ins.2019.05.037
Pérez, J., Pérez, A., Casillas, A., & Gojenola, K. (2018). Cardiology record multi-label classification using latent Dirichlet allocation. Computer methods and programs in biomedicine, 164, 111-119. https://doi.org/10.1016/j.cmpb.2018.07.002
Lu, H. M., Wei, C. P., & Hsiao, F. Y. (2016). Modeling healthcare data using multiple-channel latent Dirichlet allocation. Journal of biomedical informatics, 60, 210-223. https://doi.org/10.1016/j.jbi.2016.02.003
Roque, C., Cardoso, J. L., Connell, T., Schermers, G., & Weber, R. (2019). Topic analysis of Road safety inspections using latent dirichlet allocation: A case study of roadside safety in Irish main roads. Accident Analysis & Prevention, 131, 336-349. https://doi.org/10.1016/j.aap.2019.07.021
Zhang, W., Clark, R. A., Wang, Y., & Li, W. (2016). Unsupervised language identification based on Latent Dirichlet Allocation. Computer Speech & Language, 39, 47-66. https://doi.org/10.1016/j.csl.2016.02.001
Aydoğan, M., & Karci, A. (2020). Improving the accuracy using pre-trained word embeddings on deep neural networks for Turkish text classification. Physica A: Statistical Mechanics and its Applications, 541, 123288. https://doi.org/10.1016/j.physa.2019.123288
Catal, C., & Nangir, M. (2017). A sentiment classification model based on multiple classifiers. Applied Soft Computing, 50, 135-141. https://doi.org/10.1016/j.asoc.2016.11.022
Bay, Y., & Çelebi, E. (2016). Feature selection for enhanced author identification of Turkish text. In Information Sciences and Systems 2015: 30th International Symposium on Computer and Information Sciences (ISCIS 2015) (pp. 371-379). Springer International Publishing. https://doi.org/10.1007/978-3-319-22635-4_34
Öztürk, N., & Ayvaz, S. (2018). Sentiment analysis on Twitter: A text mining approach to the Syrian refugee crisis. Telematics and Informatics, 35(1), 136-147. https://doi.org/10.1016/j.tele.2017.10.006
Parlar, T., Özel, S. A., & Song, F. (2016). Interactions between term weighting and feature selection methods on the sentiment analysis of Turkish reviews. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 335-346). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-75487-1_26
Demirci, G. M., Keskin, Ş. R., & Doğan, G. (2019). Sentiment analysis in Turkish with deep learning. In 2019 IEEE international conference on big data (big data) (pp. 2215-2221). IEEE. https://doi.org/10.1109/BigData47090.2019.9006066
Onan, A., (2017). Türkçe Twitter mesajlarında Gizli Dirichlet Tahsisine dayalı duygu analizi. In Akademik Bilişim.
Güven, Z. A., Diri, B., & Çakaloğlu, T. (2018, April). Classification of TurkishTweet emotions by n-stage Latent Dirichlet Allocation. In 2018 Electric Electronics, Computer Science, Biomedical Engineerings' Meeting (EBBT) (pp. 1-4). IEEE. https://doi.org/10.1109/EBBT.2018.8391454
Balcıoğlu, Y. S. (2024). Analyzing Customer Sentiments and Trends in Turkish Mobile Banking Apps: A Text Mining Study. Dumlupınar Üniversitesi Sosyal Bilimler Dergisi, (80), 49-69. https://doi.org/10.51290/dpusbe.1391631
Shams, M., & Baraani-Dastjerdi, A. (2017). Enriched LDA (ELDA): Combination of latent Dirichlet allocation with word co-occurrence analysis for aspect extraction. Expert Systems with Applications, 80, 136-146. https://doi.org/10.1016/j.eswa.2017.02.038
Yeh, J. F., Tan, Y. S., & Lee, C. H. (2016). Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation. Neurocomputing, 216, 310-318. https://doi.org/10.1016/j.neucom.2016.08.017
Guo, C., Lu, M., & Wei, W. (2021). An improved LDA topic modeling method based on partition for medium and long texts. Annals of Data Science, 8(2), 331-344. https://doi.org/10.1007/s40745-019-00218-3
Lin, J. M., Bohland, J. W., Andrews, P., Burns, G. A., Allen, C. B., & Mitra, P. P. (2008). An analysis of the abstracts presented at the annual meetings of the Society for Neuroscience from 2001 to 2006. PLoS One, 3(4), e2052. https://doi.org/10.1371/journal.pone.0002052
Lienou, M., Maitre, H., & Datcu, M. (2009). Semantic annotation of satellite images using latent Dirichlet allocation. IEEE Geoscience and Remote Sensing Letters, 7(1), 28-32. https://doi.org/10.1109/lgrs.2009.2023536
Celikyilmaz, A., Hakkani-Tur, D., & Tür, G. (2010). LDA based similarity modeling for question answering. In Proceedings of the NAACL HLT 2010 Workshop on Semantic Search (pp. 1-9).
Ekinci Ekin, O. & S. İlhan, (2016). Ürün özelliklerinin konu modelleme yöntemi ile çıkarılması. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 9(1), 51–58.
Pavlinek, M., & Podgorelec, V. (2017). Text classification method based on self-training and LDA topic models. Expert Systems with Applications, 80, 83-93. https://doi.org/10.1016/j.eswa.2017.03.020
Atici, B., S.I. Omurca & E. Ekinci, (2017). Kullanici Şikayetlerindeki Ürün Özelliklerinin Gizli Dirichlet Ayirimi ile Saptanmasi, In 2nd International Conference on Computer Science and Engineering, UBMK 2017.
Li, G., Zhu, X., Wang, J., Wu, D., & Li, J. (2017). Using lda model to quantify and visualize textual financial stability report. Procedia computer science, 122, 370-376. https://doi.org/10.1016/j.procs.2017.11.382
Drosatos, G., Kavvadias, S.E., Kaldoudi, E. (2018). Topics and Trends Analysis in eHealth Literature. In: Eskola, H., Väisänen, O., Viik, J., Hyttinen, J. (eds) EMBEC & NBC 2017. EMBEC NBC 2017 2017. IFMBE Proceedings, vol 65. Springer, Singapore. https://doi.org/10.1007/978-981-10-5122-7_141
Hagen, L. (2018). Content analysis of e-petitions with topic modeling: How to train and evaluate LDA models?. Information processing & management, 54(6), 1292-1307. https://doi.org/10.1016/j.ipm.2018.05.006
Bastani, K., Namavari, H., & Shaffer, J. (2019). Latent Dirichlet allocation (LDA) for topic modeling of the CFPB consumer complaints. Expert Systems with Applications, 127, 256-271. https://doi.org/10.1016/j.eswa.2019.03.001
Bailón-Elvira, J. C., Cobo, M. J., Herrera-Viedma, E., & López-Herrera, A. G. (2019). Latent Dirichlet Allocation (LDA) for improving the topic modeling of the official bulletin of the spanish state (BOE). Procedia Computer Science, 162, 207-214. https://doi.org/10.1016/j.procs.2019.11.277
Gangadharan, V., & Gupta, D. (2020). Recognizing named entities in agriculture documents using LDA based topic modelling techniques. Procedia Computer Science, 171, 1337-1345. https://doi.org/10.1016/j.procs.2020.04.143
Chang, I. C., Yu, T. K., Chang, Y. J., & Yu, T. Y. (2021). Applying text mining, clustering analysis, and latent dirichlet allocation techniques for topic classification of environmental education journals. Sustainability, 13(19), 10856. https://doi.org/10.3390/su131910856
Sharma, C., Batra, I., Sharma, S., Malik, A., Hosen, A. S., & Ra, I. H. (2022). Predicting trends and research patterns of smart cities: A semi-automatic review using latent dirichlet allocation (LDA). IEEE Access, 10, 121080-121095. https://doi.org/10.1109/access.2022.3214310
Madzík, P., Falát, L., & Zimon, D. (2023). Supply chain research overview from the early eighties to Covid era–Big data approach based on Latent Dirichlet Allocation. Computers & Industrial Engineering, 183, 109520. https://doi.org/10.1016/j.cie.2023.109520
Park, H., Ahn, B., & Kim, T. (2024). An exploration of research trends on metaverse: topic modeling with latent dirichlet allocation. Quality & Quantity, 1-20. https://doi.org/10.1007/s11135-024-01931-9
Shashank, S., & Behera, R. K. (2024). Factors influencing recommendations for women's clothing satisfaction: A latent dirichlet allocation approach using online reviews. Journal of Retailing and Consumer Services, 81, 104011. https://doi.org/10.1016/j.jretconser.2024.104011
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391-407. https://doi.org/10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50-57). https://doi.org/10.1145/312624.312649
Campbell, J. C., Hindle, A., & Stroulia, E. (2015). Latent Dirichlet allocation: extracting topics from software engineering data. In The art and science of analyzing software data (pp. 139-159). Morgan Kaufmann. https://doi.org/10.1016/b978-0-12-411519-4.00006-9
Momtazi, S. (2018). Unsupervised Latent Dirichlet Allocation for supervised question classification. Information Processing & Management, 54(3), 380-393. https://doi.org/10.1016/j.ipm.2018.01.001
Liu, Z., Li, M., Liu, Y., & Ponraj, M. (2011, July). Performance evaluation of Latent Dirichlet Allocation in text mining. In 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) (Vol. 4, pp. 2695-2698). IEEE. https://doi.org/10.1109/FSKD.2011.6020066
Holland, J. H. (1992). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press.
Kramer, O., & Kramer, O. (2017). Genetic algorithms (pp. 11-19). Springer International Publishing.
Rocke, D. M., & Michalewicz, Z. (2000). Genetic algorithms+ data structures= evolution programs. Journal of the American Statistical Association, 95(449), 347. https://doi.org/10.2307/2669583
Gen, M., & Cheng, R. (1999). Genetic algorithms and engineering optimization. John Wiley & Sons. https://doi.org/10.1002/9780470172261
Kaya, M., & Alhajj, R. (2005). Genetic algorithm based framework for mining fuzzy association rules. Fuzzy sets and systems, 152(3), 587-601. https://doi.org/10.1016/j.fss.2004.09.014
Akın, A, available online [https://github.com/ahmetaa/zemberek-nlp]
Çağataylı, M., & Çelebi, E. (2015). The effect of stemming and stop-word-removal on automatic text classification in Turkish language. In Neural Information Processing: 22nd International Conference, ICONIP 2015, Istanbul, Turkey, November 9-12, 2015, Proceedings, Part I 22 (pp. 168-176). Springer International Publishing. https://doi.org/10.1007/978-3-319-26532-2_19
Wang, W., Feng, Y., & Dai, W. (2018). Topic analysis of online reviews for two competitive products using latent Dirichlet allocation. Electronic Commerce Research and Applications, 29, 142-156. https://doi.org/10.1016/j.elerap.2018.04.003
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Mehmet Ozcalci, Meltem Kilic (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.