Design and Development of a Causal Inference Machine Learning System based on Textual Data

Daniel Vera Nieto. (2021). Design and Development of a Causal Inference Machine Learning System based on Textual Data. Final Career Project (TFM). Universidad Politécnica de Madrid, ETSI Telecomunicación.

Abstract:
The machine learning community is increasingly aware of the limitations of the current paradigm of artificial intelligence, strongly based on pattern recognition, where deep learning approaches have mastered most machine learning tasks in the last decade. Thus, researchers are more and more interested in shifting to learn more about cause and effect and include this causal knowledge in the models that support decision-making nowadays. Natural Language Processing (NLP) is one of the Machine Learning fields that is becoming increasingly aware of the importance of causality for model interpretability and scalable and robust models. However, the infancy of this trend in the intersection of causality and NLP produces a lack of tools to carry out causal inference studies which incorporate innovative technologies that facilitate the processing and analysis of great amounts of textual data. The main objective of this project is the integration of technology as an enabling tool to find causal relations in the presence of text. This will provide an important value in speeding up causal knowledge extraction in domains where the text is predominant. To this purpose, we implemented state-of-the-art models to infer causal effects in the presence of textual data. In addition, we developed different linguistic feature extraction modules to foster and facilitate the study of the effect of linguistic properties. These modules have been proven to be useful for sentiment and emotion analysis. In fact, the work done for the emotion analysis module culminated in the design and development of a transformer-based model that we submitted to the EmoEvalEs competition framed in the IberLef 2021 Conference. We have achieved the first position in the EmoEvalEs competition, which led us to publish the paper GSI-UPM at IberLEF2021: Emotion Analysis of Spanish Tweets by Fine-tuning the XLM-RoBERTa Language Model in the proceedings of the conference. Finally, we have evaluated our system in different use cases to study the effect of sentiment, emotion, or other linguistic properties in social media and platforms where collaborative generated text data is predominant.