Designing A Pdf Malware Detection System Using Machine Learning

Authors

  • Salman Abdul Jabbaar Wiharja Universitas Pendidikan Indonesia
  • Deden Pradeka Universitas Pendidikan Indonesia
  • Wirmanto Suteddy Universitas Pendidikan Indonesia

DOI:

https://doi.org/10.32722/pt.v23i1.6540

Abstract

This research proposes an approach to build malicious PDF detection system using random forest algorithm, focusing the Evasive-PDFMal2022 dataset which is updated and extended with the addition of new datasets. This dataset includes malicious PDF files from CVE and Exploit-DB, non-malicious PDF files, as well as files from private collections and Technically-oriented PDF Collection. Features were extracted using the PDFID tool, resulting in 29 structural features that formed the basis for the Random Forest classification algorithm. Experiments showed that the model trained with the new dataset provided accuracy equivalent to the Evasive-PDFMal2022 model, at 98%, albeit with a small decrease in recall for the benign class. In addition, this research involved the creation of a website for metadata extraction and malicious PDF detection. Recognition goes to the dataset contributors, tool developers, and dataset providers from NIST and Exploit-DB. Overall, this research successfully increased the representation and diversity of the dataset, provided good model training results, improved detection from 3 malicious PDF variants to 13 variants, and created a practical tool for malicious PDF extraction and detection. Nonetheless, further development may be required to improve detection performance in more complex scenarios

Downloads

Download data is not yet available.

References

H. Bae, Y. Lee, Y. Kim, U. Hwang, S. Yoon, dan Y. Paek, “Learn2Evade: Learning-Based Generative Model for Evading PDF Malware Classifiers,” IEEE Transactions on Artificial Intelligence, vol. 2, no. 4, hlm. 299–313, Agu 2021, doi: 10.1109/tai.2021.3103139.

International Organization for Standardization, ISO 32000-2:2020 (PDF 2.0), 2 ed. Switzerland: PDF Association, Inc., 2020.

N. Nissim, A. Cohen, C. Glezer, dan Y. Elovici, “Detection of malicious PDF files and directions for enhancements: A state-of-the art survey,” Computers and Security, vol. 48. Elsevier Ltd, hlm. 246–266, 3 Februari 2015. doi: 10.1016/j.cose.2014.10.014.

Paloalto Networks, “Network Threat Trends Research Report,” 2023.

P. Singh, S. Tapaswi, dan S. Gupta, “Malware Detection in PDF and Office Documents: A survey,” Information Security Journal, vol. 29, no. 3. Taylor and Francis Inc., hlm. 134–153, 3 Mei 2020. doi: 10.1080/19393555.2020.1723747.

ilovepdf.com, “Top tips for protecting your PDFs,” iLovePDF - Online tools for PDF.

S. Y. Yerima dan A. Bashar, “Explainable Ensemble Learning Based Detection of Evasive Malicious PDF Documents,” Electronics (Basel), vol. 12, no. 3148, Jul 2023, doi: 10.3390/electronics12143148.

M. Elingiusti, L. Aniello, L. Querzoni, dan R. Baldoni, “PDF-Malware detection: A Survey and taxonomy of current techniques,” Advances in Information Security, vol. 70, hlm. 169–191, 2018, doi: 10.1007/978-3-319-73951-9_9.

D. Maiorca dan B. Biggio, “Digital Investigation of PDF Files: Unveiling Traces of Embedded Malware,” IEEE Secur Priv, vol. 17, no. 1, hlm. 63–71, Jan 2019, doi: 10.1109/MSEC.2018.2875879.

N. Fleury, T. Dubrunquez, dan I. Alouani, “PDF-Malware: An Overview on Threats, Detection and Evasion Attacks,” 2021.

A. M. Barmawi dan D. Pradeka, “Information hiding based on histogram and pixel pattern,” Journal of Cyber Security and Mobility, vol. 6, no. 4, hlm. 397–425, Okt 2017, doi: 10.13052/jcsm2245-1439.642.

D. Pradeka, “Penyembunyian Informasi dengan Metode Crypto-Steganography menggunakan Media Gambar Berbasis Mobile,” Sistem Informasi Manajemen dan Keuangan dalam Industri 4.0, hlm. 104–111, 2018.

D. Pradeka, “Implementasi Aplikasi Kriptografi Berbasis Android menggunakan Metode Subtitusi dan Permutasi,” In Search – Informatic, Science, Entrepreneur, Applied Art, Research, Humanism, vol. 18, no. 01, hlm. 161–168, Apr 2019.

S. S. Pachpute, “Malware Analysis on PDF,” San Jose State University, San Jose, CA, USA, 2019. doi: 10.31979/etd.pf8d-htjh.

D. Stevens, “PDF Tools,” Didier Stevens. Diakses: 25 Desember 2023. [Daring]. Tersedia pada: https://blog.didierstevens.com/programs/pdf-tools/

P. Singh, S. Tapaswi, dan S. Gupta, “Malware Detection in PDF and Office Documents: A survey,” Information Security Journal, vol. 29, no. 3, hlm. 134–153, Mei 2020, doi: 10.1080/19393555.2020.1723747.

M. Issakhani, P. Victor, A. Tekeoglu, dan A. H. Lashkari, “PDF Malware Detection based on Stacking Learning,” dalam International Conference on Information Systems Security and Privacy, Science and Technology Publications, Lda, 2022, hlm. 562–570. doi: 10.5220/0010908400003120.

W. Suteddy, D. Aprianti, R. Agustini, A. Adiwilaga, dan A. Atmanto, “End-To-End Evaluation of Deep Learning Architectures for Offline Handwriting Writer Identification: A Comparative Study,” JOIV : Int. J. Inform. Visualization, vol. 7, no. 1, hlm. 178185, Mar 2023.

A. N. Syafia, M. F. Hidayattullah, dan W. Suteddy, “Studi Komparasi Algoritma SVM dan Random Forest pada Analisis Sentimen Komentar Youtube BTS,” Jurnal Informatika: Jurnal pengembangan IT (JPIT), vol. 8, no. 3, hlm. 207–212, Sep 2023.

R. Fettaya dan Y. Mansour, “Detecting malicious PDF using CNN,” Jul 2020.

A. Géron, Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2 ed. Sebastopol: O’Reilly, 2019.

Amita. Kapoor, Hands-On Artificial Intelligence For IoT. PACKT Publishing Limited, 2019.

A. Rahmah, N. Sepriyanti, M. H. Zikri, I. Ambarani, dan M. Yusuf Bin Shahar, “Implementation of Support Vector Machine and Random Forest for Heart Failure Disease Classification,” Public Research Journal of Engineering, Data Technology and Computer Science, vol. 1, no. 1, hlm. 34–40, Jul 2023.

R. I. Arumnisaa dan A. W. Wijayanto, “Perbandingan Metode Ensemble Learning: Random Forest, Support Vector Machine, AdaBoost pada Klasifikasi Indeks Pembangunan Manusia (IPM),” Januari, vol. 12, no. 1, hlm. 206–2018, 2023.

M. Wainberg, B. Alipanahi, dan B. J. Frey, “Are Random Forests Truly the Best Classifiers?,” 2016.

S. A. Roseline, S. Geetha, S. Kadry, dan Y. Nam, “Intelligent Vision-Based Malware Detection and Classification Using Deep Random Forest Paradigm,” IEEE Access, vol. 8, hlm. 206303–206324, 2020, doi: 10.1109/ACCESS.2020.3036491.

H. Pramoedyo, D. Ariyanto, dan N. N. Aini, “Comparison of Random Forest and Naïve Bayes Methods for Classifying and Forecasting Soil Texture In The Area Around Das Kalikonto, East Java,” BAREKENG: Jurnal Ilmu Matematika dan Terapan, vol. 16, no. 4, hlm. 1411–1422, Des 2022, doi: 10.30598/barekengvol16iss4pp1411-1422.

D. Pradeka, A. Adiwilaga, D. A. R. Agustini, M. B. Hidayatullah, dan A. Suheryadi, Belajar Dasar Pemrograman Web serta Pengenalan Kriptografi dan Plugin Moodle, vol. 1. Bandung: Widina Media Utama, 2023.

N. Nofriani, “Machine Learning Application for Classification Prediction of Household’s Welfare Status,” JITCE (Journal of Information Technology and Computer Engineering), vol. 4, no. 02, hlm. 72–82, Sep 2020, doi: 10.25077/jitce.4.02.72-82.2020.

D. Avelino, L. Cancerlon, M. K. Ryanta, Y. H. Christianto, dan W. Wangnardy, “Penggunaan Bahasa Pemrograman Python dalam Menganalisis Perbedaan Desain Website Tren di Negara Jepang dan Dunia,” Journal of Student Development Information System (JoSDIS), vol. 3, no. 2, hlm. 51–61, 2023.

tpn, “Technically-oriented PDF Collection,” Github. Diakses: 26 Desember 2023. [Daring]. Tersedia pada: https://github.com/tpn/pdfs

ahlashkari, “PDFMalLyzer,” Behavior-Centric Cybersecurity Center (BCCC).

J. X. McKie, “PyMuPDF,” Artifex Software, Inc. Diakses: 26 Desember 2023. [Daring]. Tersedia pada: https://pymupdf.io

F. Baharuddin dan A. Tjahyanto, “Peningkatan Performa Klasifikasi Machine Learning Melalui Perbandingan Metode Machine Learning dan Peningkatan Dataset,” Jurnal Sisfokom (Sistem Informasi dan Komputer), vol. 11, no. 1, hlm. 25–31, Mar 2022, doi: 10.32736/sisfokom.v11i1.1337.

R. Imantiyar, ; Dhomas, dan H. Fudholi, “Kajian Pengaruh Dataset dan Bias Dataset terhadap Performa Akurasi Deteksi Objek,” PETIR: Jurnal Pengkajian dan Penerapan Teknik Informatika, vol. 14, no. 2, 2021, doi: 10.33322/petir.v14i2.1150.

K. Koptyra dan M. R. Ogiela, “Distributed steganography in PDF files - Secrets hidden in modified pages,” Entropy, vol. 22, no. 6, Jun 2020, doi: 10.3390/E22060600.

Downloads

Published

2024-02-22

How to Cite

Wiharja, S. A. J., Pradeka, D., & Suteddy, W. (2024). Designing A Pdf Malware Detection System Using Machine Learning. Jurnal Poli-Teknologi, 23(1), 40–54. https://doi.org/10.32722/pt.v23i1.6540

Issue

Section

Articles