Publication - Development of a Real Time Classification System of Twitter Trends based on Machine Learning Techniques

Development of a Real Time Classification System of Twitter Trends based on Machine Learning Techniques

Daniel Mata-Nieves. (2018). Development of a Real Time Classification System of Twitter Trends based on Machine Learning Techniques. Final Career Project (TFG). Universidad Politécnica de Madrid, ETSI Telecomunicación, Madrid.

Abstract:

In recent years social networks have experienced an exponential growth, we are currently in the information age, which is vital to carefully analyze to extract and exploit its full potential. Within the most used social networks, we find Twitter that is a micro-blogging service where more than 500 million messages are sent every day, called tweets, in order to share our common interests about a topic. Those tweets that are more relevant at a specific moment and become a trend, are called trending topics. The objective will be to investigate the background of these social trends to analyze and understand the cause why they occur. To achieve this, an automatic classifier will be developed to identify the trend category and inform the user in real time, in order to obtain the maximum possible benefit from it. Firstly, the Twitter API has been monitored for a week through Tweepy, from December 1 to 7, 2017, obtaining the top 10 trending topics in Spain every 30 seconds. Once the trending topics have been obtained, the most recent tweets associated with each of them have been downloaded. Later the tweets have been categorized according to two typologies. The first one has been divided into four categories: news, live events, commemorative days and memes while the second division has been made in six categories: sports, business, entertainment, health, politics and technology. Then a manual annotation of the trending topics was done, obtaining a Cohen Kappa ratio of 0.78 and 0.89 respectively, which measures the degree of agreement. Afterwards, the extracted tweets have been organized for their subsequent preprocessing and extraction of characteristics to feed the classifier. Next, classifiers have been developed that look for certain structures or patterns in the data and implement predictive models that allow its automatically optimization. Finally, different classifiers have been developed and a performance comparison has been made between them. Several algorithms were used, obtaining the best results with Support Vector Machines (SVC), ExtraTreesClassifier (ETC) and Multinomial Naive Bayes (MNB). Regarding the first categorization, the best result of 0.91 has been obtained through cross validation with ETC. Comparing this result with those obtained by other authors who have achieved a better performance of 0.81 with SVC, a emarkable improvement of 12.35% has been achieved. In the second categorization, the best result has been 0.92 with SVC, MNB and ETC, while the best result achieved by other authors was 0.78, assuming an improvement of 17.95%.

JRESEARCH_BIBTEX:

@mastersthesis{development-gsi-mastersthesis-20182,
author = "Mata-Nieves, Daniel",
abstract = "In recent years social networks have experienced an exponential growth, we are currently in the information age, which is vital to carefully analyze to extract and exploit its full potential. Within the most used social networks, we find Twitter that is a micro-blogging service where more than 500 million messages are sent every day, called tweets, in order to share our common interests about a topic. Those tweets that are more relevant at a specific moment and become a trend, are called trending topics.

The objective will be to investigate the background of these social trends to analyze
and understand the cause why they occur. To achieve this, an automatic classifier will be developed to identify the trend category and inform the user in real time, in order to obtain the maximum possible benefit from it.

Firstly, the Twitter API has been monitored for a week through Tweepy, from December 1 to 7, 2017, obtaining the top 10 trending topics in Spain every 30 seconds. Once the trending topics have been obtained, the most recent tweets associated with each of them have been downloaded. Later the tweets have been categorized according to two typologies. The first one has been divided into four categories: news, live events, commemorative days and memes while the second division has been made in six categories: sports, business, entertainment, health, politics and technology. Then a manual annotation of the trending topics was done, obtaining a Cohen Kappa ratio of 0.78 and 0.89 respectively, which measures the degree of agreement.

Afterwards, the extracted tweets have been organized for their subsequent preprocessing and extraction of characteristics to feed the classifier. Next, classifiers have been developed that look for certain structures or patterns in the data and implement predictive models that allow its automatically optimization. Finally, different classifiers have been developed and a performance comparison has been made between them. Several algorithms were used, obtaining the best results with Support Vector Machines (SVC), ExtraTreesClassifier (ETC) and Multinomial Naive Bayes (MNB).

Regarding the first categorization, the best result of 0.91 has been obtained through
cross validation with ETC. Comparing this result with those obtained by other authors  who have achieved a better performance of 0.81 with SVC, a  emarkable improvement of 12.35% has been achieved. In the second categorization, the best result has been 0.92 with SVC, MNB and ETC, while the best result achieved by other authors was 0.78, assuming an improvement of 17.95%.",
address = "ETSI Telecomunicaci{\'o}n, Madrid",
institution = "Universidad Polit{\'e}cnica de Madrid",
keywords = "Twitter;trending topics;python;scikit-learn;sefarad;Senpy",
month = "January",
title = "{D}evelopment of a {R}eal {T}ime {C}lassification {S}ystem of {T}witter {T}rends based on {M}achine {L}earning {T}echniques",
type = "TFG",
year = "2018",
}