Multi-label Classification of Commit Messages using Transfer Learning

Muhammad Usman Sarwar, Sarim Zafar, Mohamed Wiem Mkaouer, Gursimran Singh Walia, Muhammad Zubair Malik

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Scopus citations

Abstract

Commit messages are used in the industry by developers to annotate changes made to the code. Accurate classification of these messages can help monitor the software evolution process and enable better tracking for various industrial stakeholders. In this paper, we present a state of the art method for commit message classification into categories as per Swanson's maintenance activities i.e. 'Corrective', 'Perfective', and 'Adaptive'. This is a challenging task because not all commit messages are well written and informative. Existing approaches rely on keyword-based techniques to solve this problem. However, these approaches are oblivious to the full language model and do not recognize the contextual relationship between words. State of the art methodology in Natural Language Processing (NLP), is to train a context-aware neural network (Transformer) on a very large data set that encompasses the entire language and then fine-tunes it for a specific task. In this way, the model can learn the language, pay attention to the context, and then transfer that knowledge for better performance at the specific task. We use an off-the-shelf neural network called DistilBERT and fine-tune it for commit message classification task. This step is non-trivial because programming languages and commit messages have unique keywords, jargon, and idioms. This paper presents our effort in training this model and constructing the data set for this task. We describe the rules used to construct the data set. We validate our approach on industrial projects from GitHub, such as Kubernetes, Linux, TensorFlow, Spark, TypeScript, and PyTorch. We were able to achieve 87% F1-score for the commit message classification task, which is an order of magnitude accurate than previous studies.

Original languageEnglish (US)
Title of host publicationProceedings - 2020 IEEE 31st International Symposium on Software Reliability Engineering Workshops, ISSREW 2020
EditorsMarco Vieira, Henrique Madeira, Nuno Antunes, Zheng Zheng
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages37-42
Number of pages6
ISBN (Electronic)9781728198705
DOIs
StatePublished - Oct 2020
Externally publishedYes
Event31st IEEE International Symposium on Software Reliability Engineering Workshops, ISSREW 2020 - Virtual, Coimbra, Portugal
Duration: Oct 12 2020Oct 15 2020

Publication series

NameProceedings - 2020 IEEE 31st International Symposium on Software Reliability Engineering Workshops, ISSREW 2020

Conference

Conference31st IEEE International Symposium on Software Reliability Engineering Workshops, ISSREW 2020
Country/TerritoryPortugal
CityVirtual, Coimbra
Period10/12/2010/15/20

Keywords

  • commit message classification
  • software maintenance
  • software quality

ASJC Scopus subject areas

  • Software
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Multi-label Classification of Commit Messages using Transfer Learning'. Together they form a unique fingerprint.

Cite this