Sistem Deteksi Bahasa pada Dokumen menggunakan N-Gram


Badrus Zaman
Eva Hariyanti
Endah Purwanti


Language detection on a very large collection of documents can be done to increasing performance of information retrieval system. One of popular method on language detection is N-Grams, based on pieces of n-characters taken from a string. This research is developed language detection system based on N-Gram that performs by Indonesian or English language. In general, the steps being taken there were 3 phases, namely creating profile of each language, system testing, and system evaluation. Fifty documents were used to creating profile of each language, i.e. 25 Indonesian and 25 English. Sixty documents were used for system testing. System performance was evaluated using F-measures. Based on the test, obtained F-measures for unigram, bigram, and unigram respectively 0.933, 0.917, and 0.933.


How to Cite
Zaman, B., Hariyanti, E., & Purwanti, E. (2015). Sistem Deteksi Bahasa pada Dokumen menggunakan N-Gram. MULTINETICS, 1(2), 21–26.


  1. Hamzah, A. (2010). Deteksi bahasa untuk dokumen teks berbahasa Indonesia. Dalam prosiding Dukungan ICT dalam bidang industry dan manajemen ESDM. Halaman A-5 – A-13.
  2. Ahmed B., Cha, S.H, dan Tappert C., (2004). Language Identification from Text Using N-Gram Based Cumulative Frequency Addition. Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004.
  3. Grothe, L., De Luca, E.W., dan N¨urnberger, A. (2008). A Comparative Study on Language Identification Methods. Dalam Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08). Halaman 980-985.
  4. Padr´o, M.,dan Padr´o, L. (2004).Comparing methods for language identification. Dalam prosiding Procesamiento del Lenguaje Natural. Halaman 155–162.
  5. Lui M., Lau J. H., dan Baldwin T. (2014). Automatic Detection and Language Identification of Multilingual Documents. Journal of Transactions of the Association for Computational Linguistics, 2 (2014) 27-40.
  6. Ramisch, C., (2008). N-Gram models for language detection. M2R Informatique - Double diplˆome ENSIMAG – UJF/UFRIMA.