Application of Natural Language Processing techniques for biomedical document classification

Antonio Ochotorena Laynez. (2020). Application of Natural Language Processing techniques for biomedical document classification. Trabajo Fin de Titulación (TFG). Universidad Politécnica de Madrid, ETSI Telecomunicación.

Abstract:
In this project, we study the computers’ capacity for the classification of documents crowded with medical jargon as if they were experts on such topics. The technology era has brought several challenges to many different areas; one of the most affected ones is the biomedical field. Finding ways to organise the data in an optimal scheme, can enhance the physician’s work and as a final consequence, the possibility to attend more patients. Even for a doctor, medical concepts can be complex enough to require some previous research before giving a proper diagnosis. To limit this complexity, we decided to focus on one particular topic: diseases. The dataset, whose structure was ideal for the realisation of a biomedical classification attempt, was the Ohsumed collection. Thanks to techniques such as Natural Language Processing (NLP) and Machine Learning (ML), we started building the different models. Firstly, we had to create a dataframe where all the documents were sorted by their classes and successfully tokenised. Secondly, to transform tokens into word vectors, three different word embedding techniques were used: Tf-idf, Word2vec and Simon. Finally, the resulting embedding models, learned by a classifier, determined the quality of their predictions. On average, the results all over the classifiers seemed to suggest that some of the models were not assimilating medical relations effectively. Another drawback found was that the documents were multi-labelled, meaning that on average a given text could belong to three or even more different tags. Solving this issue was achievable by transforming the dataset’s identifiers into a 23 length binary array. Then, multi-labelled classifiers specially designed for these cases performed the estimation. After this trial, we computed the highest score so far attained in this project of a 68% F-score using a linear C-support vector as the classifier. This result shows that the complexity of the biomedical field requires the adoption of sophisticated feature extractors to map relations between concepts effectively. However, given other articles, our results seem to be on point and slightly better if we consider that we are using the whole dataset instead of a fraction of it.