TY - GEN
T1 - Cyber-guided Deep Neural Network for Malicious Repository Detection in GitHub
AU - Zhang, Yiming
AU - Fan, Yujie
AU - Hou, Shifu
AU - Ye, Yanfang
AU - Xiao, Xusheng
AU - Li, Pan
AU - Shi, Chuan
AU - Zhao, Liang
AU - Xu, Shouhuai
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/8
Y1 - 2020/8
N2 - As the largest source code repository, GitHub has played a vital role in modern social coding ecosystem to generate production software. Despite the apparent benefits of such social coding paradigm, its potential security risks have been largely overlooked (e.g., malicious codes or repositories could be easily embedded and distributed). To address this imminent issue, in this paper, we propose a novel framework (named GitCyber) to automate malicious repository detection in GitHub at the first attempt. In GitCyber, we first extract code contents from the repositories hosted in GitHub as the inputs for deep neural network (DNN), and then we incorporate cybersecurity domain knowledge modeled by heterogeneous information network (HIN) to design cyber-guided loss function in the learning objective of the DNN to assure the classification performance while preserving consistency with the observational domain knowledge. Comprehensive experiments based on the large-scale data collected from GitHub demonstrate that our proposed GitCyber outperforms the state-of-the-arts in malicious repository detection.
AB - As the largest source code repository, GitHub has played a vital role in modern social coding ecosystem to generate production software. Despite the apparent benefits of such social coding paradigm, its potential security risks have been largely overlooked (e.g., malicious codes or repositories could be easily embedded and distributed). To address this imminent issue, in this paper, we propose a novel framework (named GitCyber) to automate malicious repository detection in GitHub at the first attempt. In GitCyber, we first extract code contents from the repositories hosted in GitHub as the inputs for deep neural network (DNN), and then we incorporate cybersecurity domain knowledge modeled by heterogeneous information network (HIN) to design cyber-guided loss function in the learning objective of the DNN to assure the classification performance while preserving consistency with the observational domain knowledge. Comprehensive experiments based on the large-scale data collected from GitHub demonstrate that our proposed GitCyber outperforms the state-of-the-arts in malicious repository detection.
KW - Cyber-guided DNN
KW - Heterogeneous information network
KW - Malicious repository detection
UR - http://www.scopus.com/inward/record.url?scp=85092522582&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85092522582&partnerID=8YFLogxK
U2 - 10.1109/ICBK50248.2020.00071
DO - 10.1109/ICBK50248.2020.00071
M3 - Conference contribution
AN - SCOPUS:85092522582
T3 - Proceedings - 11th IEEE International Conference on Knowledge Graph, ICKG 2020
SP - 458
EP - 465
BT - Proceedings - 11th IEEE International Conference on Knowledge Graph, ICKG 2020
A2 - Chen, Enhong
A2 - Antoniou, Grigoris
A2 - Wu, Xindong
A2 - Kumar, Vipin
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 11th IEEE International Conference on Knowledge Graph, ICKG 2020
Y2 - 9 August 2020 through 11 August 2020
ER -