A Machine Learning Approach for Moderating Toxic Hinglish Comments of YouTube Videos

Akash Singh; Kumar Vaibhav; Mamta Arora

doi:10.1007/978-981-99-7817-5_14

A Machine Learning Approach for Moderating Toxic Hinglish Comments of YouTube Videos

Akash Singh, Kumar Vaibhav, Mamta Arora

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

With the increasing digitization of our world comes a new challenge: toxic comments on online platforms. While these comment sections were initially intended for meaningful discussion, they are now plagued by spam, trolls, and offensive messages. Researchers have explored various automatic detection methods using deep learning models and feature extraction techniques on various datasets, ranging from English Wikipedia to Roman Urdu social media comments, to address this problem. While these approaches have achieved impressive results, they still face limitations, such as misspelled offensive words and obfuscation. These systems often struggle in regions with multilingual societies. The objective of this study is to develop a moderation system to filter comments on YouTube in Hinglish, a hybrid language that combines Hindi and English. The proposed system employs the Text Vectorization technique to screen out toxic comments written in Hinglish, utilizing a self-curated dataset specifically tailored for this language. The developed system is capable of effectively classifying and automatically deleting toxic comments from a YouTube video. This study outlines several challenges and open problems in this area, providing insights and a useful roadmap for future work. Although the developed system may misclassify a few comments due to the limited size of the dataset, it has the potential to enhance the user experience for Hinglish-speaking users on YouTube.

Original language	English (US)
Title of host publication	Data Science and Applications - Proceedings of ICDSA 2023
Editors	Satyasai Jagannath Nanda, Rajendra Prasad Yadav, Amir H. Gandomi, Mukesh Saraswat
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	173-187
Number of pages	15
ISBN (Print)	9789819978168
DOIs	https://doi.org/10.1007/978-981-99-7817-5_14
State	Published - 2024
Externally published	Yes
Event	4th International Conference on Data Science and Applications, ICDSA 2023 - Jaipur, India Duration: Jul 14 2023 → Jul 15 2023

Publication series

Name	Lecture Notes in Networks and Systems
Volume	820
ISSN (Print)	2367-3370
ISSN (Electronic)	2367-3389

Conference

Conference	4th International Conference on Data Science and Applications, ICDSA 2023
Country/Territory	India
City	Jaipur
Period	7/14/23 → 7/15/23

Keywords

Hinglish language
Natural language processing
Sequential model
YouTube

ASJC Scopus subject areas

Control and Systems Engineering
Signal Processing
Computer Networks and Communications

Access to Document

10.1007/978-981-99-7817-5_14

Cite this

Singh, A., Vaibhav, K., & Arora, M. (2024). A Machine Learning Approach for Moderating Toxic Hinglish Comments of YouTube Videos. In S. J. Nanda, R. P. Yadav, A. H. Gandomi, & M. Saraswat (Eds.), Data Science and Applications - Proceedings of ICDSA 2023 (pp. 173-187). (Lecture Notes in Networks and Systems; Vol. 820). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-99-7817-5_14

A Machine Learning Approach for Moderating Toxic Hinglish Comments of YouTube Videos. / Singh, Akash; Vaibhav, Kumar; Arora, Mamta.
Data Science and Applications - Proceedings of ICDSA 2023. ed. / Satyasai Jagannath Nanda; Rajendra Prasad Yadav; Amir H. Gandomi; Mukesh Saraswat. Springer Science and Business Media Deutschland GmbH, 2024. p. 173-187 (Lecture Notes in Networks and Systems; Vol. 820).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Singh, A, Vaibhav, K & Arora, M 2024, A Machine Learning Approach for Moderating Toxic Hinglish Comments of YouTube Videos. in SJ Nanda, RP Yadav, AH Gandomi & M Saraswat (eds), Data Science and Applications - Proceedings of ICDSA 2023. Lecture Notes in Networks and Systems, vol. 820, Springer Science and Business Media Deutschland GmbH, pp. 173-187, 4th International Conference on Data Science and Applications, ICDSA 2023, Jaipur, India, 7/14/23. https://doi.org/10.1007/978-981-99-7817-5_14

Singh A, Vaibhav K, Arora M. A Machine Learning Approach for Moderating Toxic Hinglish Comments of YouTube Videos. In Nanda SJ, Yadav RP, Gandomi AH, Saraswat M, editors, Data Science and Applications - Proceedings of ICDSA 2023. Springer Science and Business Media Deutschland GmbH. 2024. p. 173-187. (Lecture Notes in Networks and Systems). doi: 10.1007/978-981-99-7817-5_14

Singh, Akash ; Vaibhav, Kumar ; Arora, Mamta. / A Machine Learning Approach for Moderating Toxic Hinglish Comments of YouTube Videos. Data Science and Applications - Proceedings of ICDSA 2023. editor / Satyasai Jagannath Nanda ; Rajendra Prasad Yadav ; Amir H. Gandomi ; Mukesh Saraswat. Springer Science and Business Media Deutschland GmbH, 2024. pp. 173-187 (Lecture Notes in Networks and Systems).

@inproceedings{995b1bf8e324480da8aab36a313e520d,

title = "A Machine Learning Approach for Moderating Toxic Hinglish Comments of YouTube Videos",

abstract = "With the increasing digitization of our world comes a new challenge: toxic comments on online platforms. While these comment sections were initially intended for meaningful discussion, they are now plagued by spam, trolls, and offensive messages. Researchers have explored various automatic detection methods using deep learning models and feature extraction techniques on various datasets, ranging from English Wikipedia to Roman Urdu social media comments, to address this problem. While these approaches have achieved impressive results, they still face limitations, such as misspelled offensive words and obfuscation. These systems often struggle in regions with multilingual societies. The objective of this study is to develop a moderation system to filter comments on YouTube in Hinglish, a hybrid language that combines Hindi and English. The proposed system employs the Text Vectorization technique to screen out toxic comments written in Hinglish, utilizing a self-curated dataset specifically tailored for this language. The developed system is capable of effectively classifying and automatically deleting toxic comments from a YouTube video. This study outlines several challenges and open problems in this area, providing insights and a useful roadmap for future work. Although the developed system may misclassify a few comments due to the limited size of the dataset, it has the potential to enhance the user experience for Hinglish-speaking users on YouTube.",

keywords = "Hinglish language, Natural language processing, Sequential model, YouTube",

author = "Akash Singh and Kumar Vaibhav and Mamta Arora",

note = "Publisher Copyright: {\textcopyright} 2024, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.; 4th International Conference on Data Science and Applications, ICDSA 2023 ; Conference date: 14-07-2023 Through 15-07-2023",

year = "2024",

doi = "10.1007/978-981-99-7817-5_14",

language = "English (US)",

isbn = "9789819978168",

series = "Lecture Notes in Networks and Systems",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "173--187",

editor = "Nanda, {Satyasai Jagannath} and Yadav, {Rajendra Prasad} and Gandomi, {Amir H.} and Mukesh Saraswat",

booktitle = "Data Science and Applications - Proceedings of ICDSA 2023",

address = "Germany",

}

TY - GEN

T1 - A Machine Learning Approach for Moderating Toxic Hinglish Comments of YouTube Videos

AU - Singh, Akash

AU - Vaibhav, Kumar

AU - Arora, Mamta

PY - 2024

Y1 - 2024

N2 - With the increasing digitization of our world comes a new challenge: toxic comments on online platforms. While these comment sections were initially intended for meaningful discussion, they are now plagued by spam, trolls, and offensive messages. Researchers have explored various automatic detection methods using deep learning models and feature extraction techniques on various datasets, ranging from English Wikipedia to Roman Urdu social media comments, to address this problem. While these approaches have achieved impressive results, they still face limitations, such as misspelled offensive words and obfuscation. These systems often struggle in regions with multilingual societies. The objective of this study is to develop a moderation system to filter comments on YouTube in Hinglish, a hybrid language that combines Hindi and English. The proposed system employs the Text Vectorization technique to screen out toxic comments written in Hinglish, utilizing a self-curated dataset specifically tailored for this language. The developed system is capable of effectively classifying and automatically deleting toxic comments from a YouTube video. This study outlines several challenges and open problems in this area, providing insights and a useful roadmap for future work. Although the developed system may misclassify a few comments due to the limited size of the dataset, it has the potential to enhance the user experience for Hinglish-speaking users on YouTube.

AB - With the increasing digitization of our world comes a new challenge: toxic comments on online platforms. While these comment sections were initially intended for meaningful discussion, they are now plagued by spam, trolls, and offensive messages. Researchers have explored various automatic detection methods using deep learning models and feature extraction techniques on various datasets, ranging from English Wikipedia to Roman Urdu social media comments, to address this problem. While these approaches have achieved impressive results, they still face limitations, such as misspelled offensive words and obfuscation. These systems often struggle in regions with multilingual societies. The objective of this study is to develop a moderation system to filter comments on YouTube in Hinglish, a hybrid language that combines Hindi and English. The proposed system employs the Text Vectorization technique to screen out toxic comments written in Hinglish, utilizing a self-curated dataset specifically tailored for this language. The developed system is capable of effectively classifying and automatically deleting toxic comments from a YouTube video. This study outlines several challenges and open problems in this area, providing insights and a useful roadmap for future work. Although the developed system may misclassify a few comments due to the limited size of the dataset, it has the potential to enhance the user experience for Hinglish-speaking users on YouTube.

KW - Hinglish language

KW - Natural language processing

KW - Sequential model

KW - YouTube

UR - http://www.scopus.com/inward/record.url?scp=85184284323&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85184284323&partnerID=8YFLogxK

U2 - 10.1007/978-981-99-7817-5_14

DO - 10.1007/978-981-99-7817-5_14

M3 - Conference contribution

AN - SCOPUS:85184284323

SN - 9789819978168

T3 - Lecture Notes in Networks and Systems

SP - 173

EP - 187

BT - Data Science and Applications - Proceedings of ICDSA 2023

A2 - Nanda, Satyasai Jagannath

A2 - Yadav, Rajendra Prasad

A2 - Gandomi, Amir H.

A2 - Saraswat, Mukesh

PB - Springer Science and Business Media Deutschland GmbH

T2 - 4th International Conference on Data Science and Applications, ICDSA 2023

Y2 - 14 July 2023 through 15 July 2023

ER -

A Machine Learning Approach for Moderating Toxic Hinglish Comments of YouTube Videos

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this