TY - GEN
T1 - Multi-label Classification of Commit Messages using Transfer Learning
AU - Sarwar, Muhammad Usman
AU - Zafar, Sarim
AU - Mkaouer, Mohamed Wiem
AU - Walia, Gursimran Singh
AU - Malik, Muhammad Zubair
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/10
Y1 - 2020/10
N2 - Commit messages are used in the industry by developers to annotate changes made to the code. Accurate classification of these messages can help monitor the software evolution process and enable better tracking for various industrial stakeholders. In this paper, we present a state of the art method for commit message classification into categories as per Swanson's maintenance activities i.e. 'Corrective', 'Perfective', and 'Adaptive'. This is a challenging task because not all commit messages are well written and informative. Existing approaches rely on keyword-based techniques to solve this problem. However, these approaches are oblivious to the full language model and do not recognize the contextual relationship between words. State of the art methodology in Natural Language Processing (NLP), is to train a context-aware neural network (Transformer) on a very large data set that encompasses the entire language and then fine-tunes it for a specific task. In this way, the model can learn the language, pay attention to the context, and then transfer that knowledge for better performance at the specific task. We use an off-the-shelf neural network called DistilBERT and fine-tune it for commit message classification task. This step is non-trivial because programming languages and commit messages have unique keywords, jargon, and idioms. This paper presents our effort in training this model and constructing the data set for this task. We describe the rules used to construct the data set. We validate our approach on industrial projects from GitHub, such as Kubernetes, Linux, TensorFlow, Spark, TypeScript, and PyTorch. We were able to achieve 87% F1-score for the commit message classification task, which is an order of magnitude accurate than previous studies.
AB - Commit messages are used in the industry by developers to annotate changes made to the code. Accurate classification of these messages can help monitor the software evolution process and enable better tracking for various industrial stakeholders. In this paper, we present a state of the art method for commit message classification into categories as per Swanson's maintenance activities i.e. 'Corrective', 'Perfective', and 'Adaptive'. This is a challenging task because not all commit messages are well written and informative. Existing approaches rely on keyword-based techniques to solve this problem. However, these approaches are oblivious to the full language model and do not recognize the contextual relationship between words. State of the art methodology in Natural Language Processing (NLP), is to train a context-aware neural network (Transformer) on a very large data set that encompasses the entire language and then fine-tunes it for a specific task. In this way, the model can learn the language, pay attention to the context, and then transfer that knowledge for better performance at the specific task. We use an off-the-shelf neural network called DistilBERT and fine-tune it for commit message classification task. This step is non-trivial because programming languages and commit messages have unique keywords, jargon, and idioms. This paper presents our effort in training this model and constructing the data set for this task. We describe the rules used to construct the data set. We validate our approach on industrial projects from GitHub, such as Kubernetes, Linux, TensorFlow, Spark, TypeScript, and PyTorch. We were able to achieve 87% F1-score for the commit message classification task, which is an order of magnitude accurate than previous studies.
KW - commit message classification
KW - software maintenance
KW - software quality
UR - http://www.scopus.com/inward/record.url?scp=85099867731&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85099867731&partnerID=8YFLogxK
U2 - 10.1109/ISSREW51248.2020.00034
DO - 10.1109/ISSREW51248.2020.00034
M3 - Conference contribution
AN - SCOPUS:85099867731
T3 - Proceedings - 2020 IEEE 31st International Symposium on Software Reliability Engineering Workshops, ISSREW 2020
SP - 37
EP - 42
BT - Proceedings - 2020 IEEE 31st International Symposium on Software Reliability Engineering Workshops, ISSREW 2020
A2 - Vieira, Marco
A2 - Madeira, Henrique
A2 - Antunes, Nuno
A2 - Zheng, Zheng
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 31st IEEE International Symposium on Software Reliability Engineering Workshops, ISSREW 2020
Y2 - 12 October 2020 through 15 October 2020
ER -