TY - GEN
T1 - A Machine Learning Approach for Moderating Toxic Hinglish Comments of YouTube Videos
AU - Singh, Akash
AU - Vaibhav, Kumar
AU - Arora, Mamta
N1 - Publisher Copyright:
© 2024, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
PY - 2024
Y1 - 2024
N2 - With the increasing digitization of our world comes a new challenge: toxic comments on online platforms. While these comment sections were initially intended for meaningful discussion, they are now plagued by spam, trolls, and offensive messages. Researchers have explored various automatic detection methods using deep learning models and feature extraction techniques on various datasets, ranging from English Wikipedia to Roman Urdu social media comments, to address this problem. While these approaches have achieved impressive results, they still face limitations, such as misspelled offensive words and obfuscation. These systems often struggle in regions with multilingual societies. The objective of this study is to develop a moderation system to filter comments on YouTube in Hinglish, a hybrid language that combines Hindi and English. The proposed system employs the Text Vectorization technique to screen out toxic comments written in Hinglish, utilizing a self-curated dataset specifically tailored for this language. The developed system is capable of effectively classifying and automatically deleting toxic comments from a YouTube video. This study outlines several challenges and open problems in this area, providing insights and a useful roadmap for future work. Although the developed system may misclassify a few comments due to the limited size of the dataset, it has the potential to enhance the user experience for Hinglish-speaking users on YouTube.
AB - With the increasing digitization of our world comes a new challenge: toxic comments on online platforms. While these comment sections were initially intended for meaningful discussion, they are now plagued by spam, trolls, and offensive messages. Researchers have explored various automatic detection methods using deep learning models and feature extraction techniques on various datasets, ranging from English Wikipedia to Roman Urdu social media comments, to address this problem. While these approaches have achieved impressive results, they still face limitations, such as misspelled offensive words and obfuscation. These systems often struggle in regions with multilingual societies. The objective of this study is to develop a moderation system to filter comments on YouTube in Hinglish, a hybrid language that combines Hindi and English. The proposed system employs the Text Vectorization technique to screen out toxic comments written in Hinglish, utilizing a self-curated dataset specifically tailored for this language. The developed system is capable of effectively classifying and automatically deleting toxic comments from a YouTube video. This study outlines several challenges and open problems in this area, providing insights and a useful roadmap for future work. Although the developed system may misclassify a few comments due to the limited size of the dataset, it has the potential to enhance the user experience for Hinglish-speaking users on YouTube.
KW - Hinglish language
KW - Natural language processing
KW - Sequential model
KW - YouTube
UR - http://www.scopus.com/inward/record.url?scp=85184284323&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85184284323&partnerID=8YFLogxK
U2 - 10.1007/978-981-99-7817-5_14
DO - 10.1007/978-981-99-7817-5_14
M3 - Conference contribution
AN - SCOPUS:85184284323
SN - 9789819978168
T3 - Lecture Notes in Networks and Systems
SP - 173
EP - 187
BT - Data Science and Applications - Proceedings of ICDSA 2023
A2 - Nanda, Satyasai Jagannath
A2 - Yadav, Rajendra Prasad
A2 - Gandomi, Amir H.
A2 - Saraswat, Mukesh
PB - Springer Science and Business Media Deutschland GmbH
T2 - 4th International Conference on Data Science and Applications, ICDSA 2023
Y2 - 14 July 2023 through 15 July 2023
ER -