GA-LDA Approach for Topic Modeling in Turkish Accounting and Finance Articles: Performance Optimization in Text Classification

Authors

  • Mehmet Ozcalci Department of International Trade and Logisctics, Faculty of Economics and Administrative Sciences, Kilis 7 Aralik University, Kilis, Turkey Author https://orcid.org/0000-0003-0384-6872
  • Meltem Kilic Department of International Trade and Logisctics, Faculty of Economics and Administrative Sciences, Kahramanmaras Sutcu Imam University, Kahramanmaras, Turkey Author https://orcid.org/0000-0001-8978-9076

DOI:

https://doi.org/10.31181/sor21202521

Keywords:

Topic modeling, Text mining, LDA, Genetic Algorithm

Abstract

The volume of research in the social sciences is expanding rapidly, creating significant challenges in extracting meaningful insights from unstructured text, particularly from articles lacking a classification system. Analyzing these high-volume texts offers numerous advantages, including the ability to automatically identify topic relevance and track thematic trends over time. Such insights are valuable for journal management and enable researchers to access detailed information about evolving areas of study. Latent Dirichlet Allocation (LDA) is a widely used method for topic modeling, effectively extracting topics from textual data. However, its performance can be further enhanced through optimization techniques such as Genetic Algorithms (GA). This study introduces an intelligent GA-LDA framework designed to optimize word subsets for LDA, thereby improving its predictive capabilities. The proposed system is applied to a dataset of 928 abstracts from a Turkish-language academic journal specializing in accounting and finance, covering publications from 2005 to 2020. The genetic algorithm selects optimal word subsets for LDA analysis, with perplexity scores serving as the fitness function to guide the optimization process. Experimental results demonstrate that the GA-enhanced LDA significantly improves classification accuracy and topic modeling performance. This study not only underscores the potential of GA-LDA in handling unstructured text but also provides a robust tool for advancing automated content analysis in Turkish academic literature.

Downloads

Download data is not yet available.

References

Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia tools and applications, 78, 15169-15211. https://doi.org/10.1007/s11042-018-6894-4

Kim, S., Park, H., & Lee, J. (2020). Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: A study on blockchain technology trend analysis. Expert Systems with Applications, 152, 113401. https://doi.org/10.1016/j.eswa.2020.113401

Agrawal, A., Fu, W., & Menzies, T. (2018). What is wrong with topic modeling? And how to fix it using search-based software engineering. Information and Software Technology, 98, 74-88. https://doi.org/10.1016/j.infsof.2018.02.005

Ding, S., Li, Z., Liu, X., Huang, H., & Yang, S. (2019). Diabetic complication prediction using a similarity-enhanced latent Dirichlet allocation model. Information Sciences, 499, 12-24. https://doi.org/10.1016/j.ins.2019.05.037

Pérez, J., Pérez, A., Casillas, A., & Gojenola, K. (2018). Cardiology record multi-label classification using latent Dirichlet allocation. Computer methods and programs in biomedicine, 164, 111-119. https://doi.org/10.1016/j.cmpb.2018.07.002

Lu, H. M., Wei, C. P., & Hsiao, F. Y. (2016). Modeling healthcare data using multiple-channel latent Dirichlet allocation. Journal of biomedical informatics, 60, 210-223. https://doi.org/10.1016/j.jbi.2016.02.003

Roque, C., Cardoso, J. L., Connell, T., Schermers, G., & Weber, R. (2019). Topic analysis of Road safety inspections using latent dirichlet allocation: A case study of roadside safety in Irish main roads. Accident Analysis & Prevention, 131, 336-349. https://doi.org/10.1016/j.aap.2019.07.021

Zhang, W., Clark, R. A., Wang, Y., & Li, W. (2016). Unsupervised language identification based on Latent Dirichlet Allocation. Computer Speech & Language, 39, 47-66. https://doi.org/10.1016/j.csl.2016.02.001

Aydoğan, M., & Karci, A. (2020). Improving the accuracy using pre-trained word embeddings on deep neural networks for Turkish text classification. Physica A: Statistical Mechanics and its Applications, 541, 123288. https://doi.org/10.1016/j.physa.2019.123288

Catal, C., & Nangir, M. (2017). A sentiment classification model based on multiple classifiers. Applied Soft Computing, 50, 135-141. https://doi.org/10.1016/j.asoc.2016.11.022

Bay, Y., & Çelebi, E. (2016). Feature selection for enhanced author identification of Turkish text. In Information Sciences and Systems 2015: 30th International Symposium on Computer and Information Sciences (ISCIS 2015) (pp. 371-379). Springer International Publishing. https://doi.org/10.1007/978-3-319-22635-4_34

Öztürk, N., & Ayvaz, S. (2018). Sentiment analysis on Twitter: A text mining approach to the Syrian refugee crisis. Telematics and Informatics, 35(1), 136-147. https://doi.org/10.1016/j.tele.2017.10.006

Parlar, T., Özel, S. A., & Song, F. (2016). Interactions between term weighting and feature selection methods on the sentiment analysis of Turkish reviews. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 335-346). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-75487-1_26

Demirci, G. M., Keskin, Ş. R., & Doğan, G. (2019). Sentiment analysis in Turkish with deep learning. In 2019 IEEE international conference on big data (big data) (pp. 2215-2221). IEEE. https://doi.org/10.1109/BigData47090.2019.9006066

Onan, A., (2017). Türkçe Twitter mesajlarında Gizli Dirichlet Tahsisine dayalı duygu analizi. In Akademik Bilişim.

Güven, Z. A., Diri, B., & Çakaloğlu, T. (2018, April). Classification of TurkishTweet emotions by n-stage Latent Dirichlet Allocation. In 2018 Electric Electronics, Computer Science, Biomedical Engineerings' Meeting (EBBT) (pp. 1-4). IEEE. https://doi.org/10.1109/EBBT.2018.8391454

Balcıoğlu, Y. S. (2024). Analyzing Customer Sentiments and Trends in Turkish Mobile Banking Apps: A Text Mining Study. Dumlupınar Üniversitesi Sosyal Bilimler Dergisi, (80), 49-69. https://doi.org/10.51290/dpusbe.1391631

Shams, M., & Baraani-Dastjerdi, A. (2017). Enriched LDA (ELDA): Combination of latent Dirichlet allocation with word co-occurrence analysis for aspect extraction. Expert Systems with Applications, 80, 136-146. https://doi.org/10.1016/j.eswa.2017.02.038

Yeh, J. F., Tan, Y. S., & Lee, C. H. (2016). Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation. Neurocomputing, 216, 310-318. https://doi.org/10.1016/j.neucom.2016.08.017

Guo, C., Lu, M., & Wei, W. (2021). An improved LDA topic modeling method based on partition for medium and long texts. Annals of Data Science, 8(2), 331-344. https://doi.org/10.1007/s40745-019-00218-3

Lin, J. M., Bohland, J. W., Andrews, P., Burns, G. A., Allen, C. B., & Mitra, P. P. (2008). An analysis of the abstracts presented at the annual meetings of the Society for Neuroscience from 2001 to 2006. PLoS One, 3(4), e2052. https://doi.org/10.1371/journal.pone.0002052

Lienou, M., Maitre, H., & Datcu, M. (2009). Semantic annotation of satellite images using latent Dirichlet allocation. IEEE Geoscience and Remote Sensing Letters, 7(1), 28-32. https://doi.org/10.1109/lgrs.2009.2023536

Celikyilmaz, A., Hakkani-Tur, D., & Tür, G. (2010). LDA based similarity modeling for question answering. In Proceedings of the NAACL HLT 2010 Workshop on Semantic Search (pp. 1-9).

Ekinci Ekin, O. & S. İlhan, (2016). Ürün özelliklerinin konu modelleme yöntemi ile çıkarılması. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 9(1), 51–58.

Pavlinek, M., & Podgorelec, V. (2017). Text classification method based on self-training and LDA topic models. Expert Systems with Applications, 80, 83-93. https://doi.org/10.1016/j.eswa.2017.03.020

Atici, B., S.I. Omurca & E. Ekinci, (2017). Kullanici Şikayetlerindeki Ürün Özelliklerinin Gizli Dirichlet Ayirimi ile Saptanmasi, In 2nd International Conference on Computer Science and Engineering, UBMK 2017.

Li, G., Zhu, X., Wang, J., Wu, D., & Li, J. (2017). Using lda model to quantify and visualize textual financial stability report. Procedia computer science, 122, 370-376. https://doi.org/10.1016/j.procs.2017.11.382

Drosatos, G., Kavvadias, S.E., Kaldoudi, E. (2018). Topics and Trends Analysis in eHealth Literature. In: Eskola, H., Väisänen, O., Viik, J., Hyttinen, J. (eds) EMBEC & NBC 2017. EMBEC NBC 2017 2017. IFMBE Proceedings, vol 65. Springer, Singapore. https://doi.org/10.1007/978-981-10-5122-7_141

Hagen, L. (2018). Content analysis of e-petitions with topic modeling: How to train and evaluate LDA models?. Information processing & management, 54(6), 1292-1307. https://doi.org/10.1016/j.ipm.2018.05.006

Bastani, K., Namavari, H., & Shaffer, J. (2019). Latent Dirichlet allocation (LDA) for topic modeling of the CFPB consumer complaints. Expert Systems with Applications, 127, 256-271. https://doi.org/10.1016/j.eswa.2019.03.001

Bailón-Elvira, J. C., Cobo, M. J., Herrera-Viedma, E., & López-Herrera, A. G. (2019). Latent Dirichlet Allocation (LDA) for improving the topic modeling of the official bulletin of the spanish state (BOE). Procedia Computer Science, 162, 207-214. https://doi.org/10.1016/j.procs.2019.11.277

Gangadharan, V., & Gupta, D. (2020). Recognizing named entities in agriculture documents using LDA based topic modelling techniques. Procedia Computer Science, 171, 1337-1345. https://doi.org/10.1016/j.procs.2020.04.143

Chang, I. C., Yu, T. K., Chang, Y. J., & Yu, T. Y. (2021). Applying text mining, clustering analysis, and latent dirichlet allocation techniques for topic classification of environmental education journals. Sustainability, 13(19), 10856. https://doi.org/10.3390/su131910856

Sharma, C., Batra, I., Sharma, S., Malik, A., Hosen, A. S., & Ra, I. H. (2022). Predicting trends and research patterns of smart cities: A semi-automatic review using latent dirichlet allocation (LDA). IEEE Access, 10, 121080-121095. https://doi.org/10.1109/access.2022.3214310

Madzík, P., Falát, L., & Zimon, D. (2023). Supply chain research overview from the early eighties to Covid era–Big data approach based on Latent Dirichlet Allocation. Computers & Industrial Engineering, 183, 109520. https://doi.org/10.1016/j.cie.2023.109520

Park, H., Ahn, B., & Kim, T. (2024). An exploration of research trends on metaverse: topic modeling with latent dirichlet allocation. Quality & Quantity, 1-20. https://doi.org/10.1007/s11135-024-01931-9

Shashank, S., & Behera, R. K. (2024). Factors influencing recommendations for women's clothing satisfaction: A latent dirichlet allocation approach using online reviews. Journal of Retailing and Consumer Services, 81, 104011. https://doi.org/10.1016/j.jretconser.2024.104011

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391-407. https://doi.org/10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9

Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50-57). https://doi.org/10.1145/312624.312649

Campbell, J. C., Hindle, A., & Stroulia, E. (2015). Latent Dirichlet allocation: extracting topics from software engineering data. In The art and science of analyzing software data (pp. 139-159). Morgan Kaufmann. https://doi.org/10.1016/b978-0-12-411519-4.00006-9

Momtazi, S. (2018). Unsupervised Latent Dirichlet Allocation for supervised question classification. Information Processing & Management, 54(3), 380-393. https://doi.org/10.1016/j.ipm.2018.01.001

Liu, Z., Li, M., Liu, Y., & Ponraj, M. (2011, July). Performance evaluation of Latent Dirichlet Allocation in text mining. In 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) (Vol. 4, pp. 2695-2698). IEEE. https://doi.org/10.1109/FSKD.2011.6020066

Holland, J. H. (1992). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press.

Kramer, O., & Kramer, O. (2017). Genetic algorithms (pp. 11-19). Springer International Publishing.

Rocke, D. M., & Michalewicz, Z. (2000). Genetic algorithms+ data structures= evolution programs. Journal of the American Statistical Association, 95(449), 347. https://doi.org/10.2307/2669583

Gen, M., & Cheng, R. (1999). Genetic algorithms and engineering optimization. John Wiley & Sons. https://doi.org/10.1002/9780470172261

Kaya, M., & Alhajj, R. (2005). Genetic algorithm based framework for mining fuzzy association rules. Fuzzy sets and systems, 152(3), 587-601. https://doi.org/10.1016/j.fss.2004.09.014

Akın, A, available online [https://github.com/ahmetaa/zemberek-nlp]

Çağataylı, M., & Çelebi, E. (2015). The effect of stemming and stop-word-removal on automatic text classification in Turkish language. In Neural Information Processing: 22nd International Conference, ICONIP 2015, Istanbul, Turkey, November 9-12, 2015, Proceedings, Part I 22 (pp. 168-176). Springer International Publishing. https://doi.org/10.1007/978-3-319-26532-2_19

Wang, W., Feng, Y., & Dai, W. (2018). Topic analysis of online reviews for two competitive products using latent Dirichlet allocation. Electronic Commerce Research and Applications, 29, 142-156. https://doi.org/10.1016/j.elerap.2018.04.003

http://www.kemik.yildiz.edu.tr/?id=28

Published

2025-02-21

How to Cite

Ozcalci, M., & Kilic, M. (2025). GA-LDA Approach for Topic Modeling in Turkish Accounting and Finance Articles: Performance Optimization in Text Classification. Spectrum of Operational Research, 2(1), 305-322. https://doi.org/10.31181/sor21202521