Инд. авторы: Barakhnin V.B., Kozhemyakina O.Yu., Mukhamediev R.I., Borzilova Yu.S., Yakunin K.O.
Заглавие: The design of the structure of the software system for processing text document corpus
Библ. ссылка: Barakhnin V.B., Kozhemyakina O.Yu., Mukhamediev R.I., Borzilova Yu.S., Yakunin K.O. The design of the structure of the software system for processing text document corpus // Business Informatics. - 2019. - Vol.13. - Iss. 4. - P.60-72. - ISSN 2587-814X. - EISSN 2587-8158.
Внешние системы: DOI: 10.17323/1998-0663.2019.4.60.72; РИНЦ: 42353712; SCOPUS: 2-s2.0-85096699051; WoS: 000514221200006;
Реферат: eng: One of the most difficult tasks in the field of data mining is the development of universal tools for the analysis of texts written in the literary and business styles. A popular path in the development of algorithms for processing text document corpus is the use of machine learning methods that allow one to solve NLP (natural language processing) tasks. The basis for research in the field of natural language processing is to be found in the following factors: the specificity of the structure of literary and business style texts (all of which requires the formation of separate datasets and, in the case of machine learning methods, the additional feature selection) and the lack of complete systems of mass processing of text documents for the Russian language (in relation to the scientific community-in the commercial environment, there are some systems of smaller scale, which are solving highly specialized tasks, for example, the definition of the tonality of the text). The aim of the current study is to design and further develop the structure of a text document corpus processing system. The design took into account the requirements for large-scale systems: modularity, the ability to scale components, the conditional independence of components. The system we designed is a set of components, each of which is formed and used in the form of Docker-containers. The levels of the system are: the data processing level, the data storage level, the visualization and management of the results of data processing (visualization and management level). At the data processing level, the text documents (for example, news events) are collected (scrapped) and further processed using an ensemble of machine learning methods, each of which is implemented in the system as a separate Airflow-task. The results are placed for storage in a relational database; ElasticSearch is used to increase the speed of data search (more than 1 million units). The visualization of statistics which is obtained as a result of the algorithms is carried out using the Plotly plugin. The administration and the viewing of processed texts are available through a web-interface using the Django framework. The general scheme of the interaction of components is organized on the principle of ETL (extract, transform, load). Currently the system is used to analyze the corpus of news texts in order to identify information of a destructive nature. In the future, we expect to improve the system and to publish the components in the open repository GitHub for access by the scientific community.
Ключевые слова: natural language processing; streaming word processing; text analysis information system; development of a text corpus processing system;
Издано: 2019
Физ. характеристика: с.60-72
Цитирование: 1. Barakhnin V.B., Kuchin Ya.I., Muhamedyev R.I. (2018). On the problem of identification of fake news and of the algorithms for monitoring them. Proceedings of the IIIInternational Conference on Informatics and Applied Mathematics, Almaty, Kazakhstan, 26-29 September 2018, pp.113-118 (in Russian). 2. Shokin Yu.I., Fedotov A.M., Barakhnin V.B. (2010) Technologies for construction of processing software systems dealing with semistructured documents aimed at information support of scientific activity. Computational Technologies, vol. 15, no 6, pp. 111-125 (in Russian). 3. Barakhnin V.B., Kozhemyakina O.Yu., Borzilova Yu.S. (2019) The development of the information system of the representation of the complex analysis results for the poetic texts. Vestnik NSU. Series: Information Technologies, vol. 17, no 1, pp. 5-17 (in Russian). DOI: 10.25205/1818-7900-2019-17-1-5-17. 4. Bolshakova E.I., Klishinskii E.S., Lande D.V., Noskov A.A., Peskova O.V., Yagunova E.V. (2011) Automatic natural language text processing and computer linguistics. Moscow: MIEM (in Russian). 5. Pang B., Lee L., Vaithyanathan S. (2002) Thumbs up? Sentiment classification using machine learning techniques. Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), Philadelphia, PA, USA, 6-7July 2002, pp. 79-86. DOI: 10.3115/1118693.1118704. 6. Choi Y., Cardie Cl., Riloff E., Patwardhan S. (2005) Identifying sources of opinions with conditional random fields and extraction patterns. Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT 2005). Vancouver, British Columbia, Canada, 6-8 October 2005, pp. 355-362. 7. Manning C.D. (2011) Part-of-speech tagging from 97% to 100%: Is it time for some linguistics? Proceedings of the 12th International Conference "Computational Linguistics and Intelligent T ext Processing" (CICLing 2011), Tokyo, Japan, 20-26 February 2011, pp. 171-189. 8. Mukhamedyev R., et al. (2020) Assessment of the dynamics of publication activity in the field of natural language processing and deep learning. Proceedings of the 4th International Conference on Digital Transformation and Global Society, St. Petersburg, Russia, 19-21 June 2019. Springer, 2020 (in press). 9. Tarasov D.S. (2015) Deep recurrent neural networks for multiple language aspect-based sentiment analysis. Computational Linguistics and Intellectual Technologies: Proceedings of Annual International Conference "Dialogue-2015", no 14 (21), vol. 2, pp. 65-74. 10. Garcia-Moya L., Anaya-Sanchez H., Berlanga-Llavori R. (2013) Retrieving product features and opinions from customer reviews. IEEE Intelligent Systems, vol. 28, no 3, pp. 19-27. DOI: 10.1109/MIS.2013.37. 11. Mavljutov R.R., Ostapuk N.A. (2013) Using basic syntactic relations for sentiment analysis. Proceedings of the International Conference "Dialogue 2013", Bekasovo, Russia, 29May - 2 June 2013, pp. 101-110. 12. Prabowo R., Thelwall M. (2009) Sentiment analysis: A combined approach. Journal of Informetrics, vol. 3, no 2, pp. 143-157. DOI: 10.1016/j.joi.2009.01.003. 13. Dai W., Xue G.-R., Yang Q., Yu Y. (2007) Transferring naive Bayes classifiers for text classification. Proceedings of the 22nd National Conference on Artificial intelligence (AAAI 07). Vancouver, British Columbia, Canada, 26-27 July 2007, vol. 1, pp. 540-545. 14. Cortes C., Vapnik V. (1995) Support-vector networks. Machine Learning, vol. 20, no 3, pp. 273-297. DOI: 10.1023/A:1022627411411. 15. Friedman J.H. (2001) Greedy function approximation: a gradient boosting machine. Annals of Statistics, vol. 29, no 5, pp. 1189-1232. 16. Zhang G.P. (2000) Neural networks for classification: A survey. IEEE Transactions on Systems, Man, and Cybernetics. Part C (Applications and Reviews), vol. 30, no 4, pp. 451-462. 17. Schmidhuber J. (2015) Deep learning in neural networks: An overview. Neural Networks, no 61, pp. 85-117. DOI: 10.1016/j.neunet.2014.09.003. 18. Devlin J., Chang M.-W., Lee K., Toutanova K. (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. 19. Vladimirova T.N., Vinogradova M.V., Vlasov A.I., Shatsky A.A. (2019) Assessment of news items objectivity in mass media of countries with intelligence systems: The Brexit case. Media Watch, vol. 10, no 3, pp. 471-483. DOI: 10.15655/mw/2019/v10i3/49680. 20. Romanov A.S., Vasilieva M.I., Kurtukova A.V., Meshcheryakov R.V. (2018) Sentiment analysis of text using machine learning techniques. Proceedings of the 2nd International Conference " R. Piotrowski's Readings in Language Engineering and Applied Linguistics (Saint-Petersburg, 2017), pp. 86-95 (in Russian). 21. Barakhnin V.B., Mukhamedyev R.I., Mussabaev R.R., Kozhemyakina O.Yu., Issayeva A., Kuchin Ya.I., Murzakhmetov S.B., Yakunin K.O. (2019) Methods to identify the destructive information. Journal of Physics: Conference Series, vol. 1405, no 1. DOI: 10.1088/1742-6596/1405/1/012004. 22. Barakhnin V.B., Kozhemyakina O.Y., Zabaykin A.V. (2015) The algorithms of complex analysis of Russian poetic texts for the purpose of automation of the process of creation of metric reference books and concordances. CEUR Workshop Proceedings, vol. 1536, pp. 138-143.