Инд. авторы: Ryabko B.Y., Gus’kov A.E., Selivanova I.V.
Заглавие: Information-Theoretic Method for Classification of Texts
Библ. ссылка: Ryabko B.Y., Gus’kov A.E., Selivanova I.V. Information-Theoretic Method for Classification of Texts // Problems of Information Transmission. - 2017. - Vol.53. - Iss. 3. - P.294-304. - ISSN 0032-9460. - EISSN 1608-3253. - https://link.springer.com/article/10.1134/S0032946017030115
Внешние системы: DOI: 10.1134/S0032946017030115; РИНЦ: 31143076; SCOPUS: 2-s2.0-85031754667; WoS: 000412936700011;
Реферат: eng: We consider a method for automatic (i.e., unmanned) text classification based on methods of universal source coding (or “data compression”). We show that under certain restrictions the proposed method is consistent, i.e., the classification error tends to zero with increasing text lengths. As an example of practical use of the method we consider the classification problem for scientific texts (research papers, books, etc.). The proposed method is experimentally shown to be highly efficient. © 2017, Pleiades Publishing, Inc.
Ключевые слова: Text processing; Universal source coding; Text length; Text classification; Scientific texts; Research papers; Practical use; Information-theoretic methods; Classification errors; Information theory; Classification (of information);
Издано: 2017
Физ. характеристика: с.294-304
Ссылка: https://link.springer.com/article/10.1134/S0032946017030115
Цитирование: 1. Thapar, N., Using Compression for Source-Based Classification of Text, Master Thesis, Dept. of Electrical Engineering and Computer Science, MIT, Cambridge, USA, 2001. 2. Kukushkina, O.V., Polikarpov, A.A., and Khmelev, D.V., Using Literal and Grammatical Statistics for Authorship Attribution, Probl. Peredachi Inf., 2001, vol. 37, no. 2, pp. 96–109 [Probl. Inf. Trans. (Engl. Transl.), 2001, vol. 37, no. 2, pp. 172–184]. 3. Khmelev, D.V., Complexity Approach to Disputed Authorship Attribution, in Russian Language: Contemporaneity and Fates in History (Proc. Int. Congress of Russian Language Researchers, Moscow, Mar. 13–16, 2001), pp. 426–427. 4. Cilibrasi, R. and Vitányi, P.M.B., Clustering by Compression, IEEE Trans. Inform. Theory, 2005, vol. 51, no. 4, pp. 1523–1545. 5. Cilibrasi, R., Vitányi, P., and deWolf, R., Algorithmic Clustering of Music Based on String Compression, Computer Music J., 2004, vol. 28, no. 4, pp. 49–67. 6. Li, M., Chen, X., Li, X., Ma, B., and Vitányi, P.M.B., The Similarity Metric, IEEE Trans. Inform. Theory, 2004, vol. 50, no. 12, pp. 3250–3264. 7. Ryabko, B., Astola, J., and Malyutov, M., Compression-Based Methods of Statistical Analysis and Prediction of Time Series, New York: Springer, 2016. 8. Teahan, W.J. and Harper, D.J., Using Compression-Based Language Models for Text Categorization, Language Modeling for Information Retrieval, Croft, W.B. and Lafferty, J., Eds., Dordrecht: Kluwer, 2003, pp. 141–165. 9. Cover, T.M. and Thomas, J.A., Elements of Information Theory, New York: Wiley, 1991. 10. Győrfi, L., Morvai, G., and Yakowitz, S.J., Limits to Consistent On-line Forecasting for Ergodic Time Series, IEEE Trans. Inform. Theory, 1998, vol. 44, no. 2, pp. 886–892.