Publication - Design and development of a System for Detecting Cyberbullying in Twitter based on Machine Learning Techniques

Design and development of a System for Detecting Cyberbullying in Twitter based on Machine Learning Techniques

Jaime Palos. (2018). Design and development of a System for Detecting Cyberbullying in Twitter based on Machine Learning Techniques. Final Career Project (TFG). Universidad Politécnica de Madrid, ETSI Telecomunicación, Madrid.

Abstract:

Internet has completely changed our society, the way we work, the way we learn and, in a radical way, the way we communicate. Today we live in a world in which face to face conversations have been replaced by 140 character statements. And despite the great benefits all these changes implies, this project is going to focus in one of the main disadvantages isolating us due to the impunity and anonymity that comes with this new way of communication. According to the Global Youth Online Behavior Survey developed by Microsoft in 2012, already one of every three young Spaniards were suffering cyberbullying [37] and, in 2016, Spain became one of the top countries where children were suffering cyberbullying, especially 13 year olds according to a report from the World Health Organization (WHO), that alerts of the great risk of depression and suicide as a consequence of “Cyberbullying”, being defined as the use of electronic communication to bully a person or group of people, typically by sending messages of an intimidating or threatening nature or disclosure of confidential or fake information. For all of his, cyberbullying is nowadays one of the main problems among new generations, to the point of, in June, 2017, the Instituto Nacional de Ciberseguridad (INCIBE) initiated a direct help-line destined to solve problems regarding the use of internet. This free and confidential service is destined to children and teenagers worried by any aspect regarding internet. The goal of this project is to end with this social problem in a quick, automatic and efficient way, without needing to wait for a child to commit the act of valor of reporting their stalker or bully. This kind of behavior will be studied in this project focusing in offensive, aggressive or hostile language that could be characterized as Cyberbullying and the source of information chosen is Twitter. This study will mainly focus on the detailed analysis of sexual predatory behaviors, since as Internet Safety 101 organization says, one out of every seven children receives a sexual invitation throughout Internet [1]. This analysis is based on words or insults and grammatical structures used by cyber-bullies in social networks. The main idea is to develop a classification system that is able to predict whether a Tweet is offensive or hostile, contains sexual connotations or fits within a sexual predator’s profile, and can be classified as cyberbullying. To accomplish this, the programming language chosen is Python, using automatic-learning tools such as Scikit-learn with supervised machine learning techniques Natural Language Process (NLP) tools. To test and evaluate this project different models will be used, analyzing later their results to choose the more accurate one. A data set of tweets in English has been chosen as the source of data as well as data from de Chatcoder and the dataset given in PAN12 competition . Finally the system’s implementation as a service will be done by creating a plugin in the platform Senpy, which allows us this implementation.

JRESEARCH_BIBTEX:

@mastersthesis{design-gsi-mastersthesis-20181,
author = "Palos, Jaime",
abstract = "Internet has completely changed our society, the way we work, the way we learn and, in a radical way, the way we communicate. Today we live in a world in which face to face conversations have been replaced by 140 character statements. And despite the great benefits all these changes implies, this project is going to focus in one of the main disadvantages isolating us due to the impunity and anonymity that comes with this new way of communication.

According to the Global Youth Online Behavior Survey developed by Microsoft in 2012, already one of every three young Spaniards were suffering cyberbullying [37] and, in 2016, Spain became one of the top countries where children were suffering cyberbullying, especially 13 year olds according to a report from the World Health Organization (WHO), that alerts of the great risk of depression and suicide as a consequence of “Cyberbullying”, being defined as the use of electronic communication to bully a person or group of people, typically by sending messages of an intimidating or threatening nature or disclosure of confidential or fake information.

For all of his, cyberbullying is nowadays one of the main problems among new generations, to the point of, in June, 2017, the Instituto Nacional de Ciberseguridad (INCIBE) initiated a direct help-line destined to solve problems regarding the use of internet. This free and confidential service is destined to children and teenagers worried by any aspect regarding internet. The goal of this project is to end with this social problem in a quick, automatic and efficient way, without needing to wait for a child to commit the act of valor of reporting their stalker or bully.

This kind of behavior will be studied in this project focusing in offensive, aggressive or hostile language that could be characterized as Cyberbullying and the source of information chosen is Twitter. This study will mainly focus on the detailed analysis of sexual predatory behaviors, since as Internet Safety 101 organization says, one out of every seven children receives a sexual invitation throughout Internet [1]. This analysis is based on words or insults and grammatical structures used by cyber-bullies in social networks. The main idea is to develop a classification system that is able to predict whether a Tweet is offensive or hostile, contains sexual connotations or fits within a sexual predator’s profile, and can be classified as cyberbullying. To accomplish this, the programming language chosen is Python, using automatic-learning tools such as Scikit-learn with supervised machine learning techniques Natural Language Process (NLP) tools. To test and evaluate this project different models will be used, analyzing later their results to choose the more accurate one. A data set of tweets in English has been chosen as the source of data as well as data from de Chatcoder  and the dataset given in PAN12 competition . Finally the system’s implementation as a service will be done by creating a plugin in the platform Senpy, which allows us this implementation.",
address = "ETSI Telecomunicaci{\'o}n, Madrid",
institution = "Universidad Polit{\'e}cnica de Madrid",
keywords = "machine learning;python;twitter;Senpy;sexucal predator;cyberbulling",
month = "January",
title = "{D}esign and development of a {S}ystem for {D}etecting {C}yberbullying in {T}witter based on {M}achine {L}earning {T}echniques",
type = "TFG",
year = "2018",
}