MENU

  • ¿Recordar usuario?
  • ¿Recordar contraseña?
GSI GSI
  • Español
  • English (UK)
  • Inicio
  • Sobre nosotros
    • Nuestro grupo
    • Personas
    • Notas de prensa
    • Noticias
    • Presentación
    • Brochure
    • Afiliaciones
    • Logos GSI
    • Observatorio UPM
    • Innovación Educativa
  • Investigación
    • Líneas de investigación
    • Publicaciones
    • Proyectos
    • Eventos
  • Tecnologías
    • Ontologías
    • Software
    • Demos
    • Canal GSI
    • Aplicaciones en Google Play
  • Ofertas
  • Contacto
  1. Inicio
  2. Trivalent

Trivalent

  • GSI Crawler
  • Scrapers
  • Senpy Plugins
  • Pipelines
  • Playground
  • Annotation
  • Publications
  • Datasets
  • Downloads

GSI Crawler

GSI Crawler is a system developed by GSI group capable of extracting, analyzing, enriching and storing information from online sources, as well as displaying visualizations of the gathered information. These flows of information are called pipelines, which start with scraper modules and are followed by the analysis and enrichment modules which belong to the Senpy plugins community.

This dashboard shows some of the capabilities of GSI Crawler. 

 

Scrapers

They allow to extract information from several web sources under the news of social media categories. Currently, the available newspapers are CNN News, The New York Times and AlJazeera. Additionally, it is also possible to extract information from PDF sources such as Dabiq Magazine, which has been the official Daesh propaganda magazine for years.

  • CNN News
  • The New York Times
  • AlJazeera
  • Dabiq magazine
  • Rumiyah magazine

CNN News


 {"@type": "schema:NewsArticle",

  "@id": "https://www.cnn.com/2018/04/09/politics/syria-donald-trump/index.html",
  "schema:dateModified": "2018-04-09T23:59:23Z",
  "schema:articleBody": "A decision by President Donald Trump to use force in retaliation [...]",
  "schema:about": [http:/dbpedia.org/resource/Donald_Trump, "Biological and chemical weapons", "Syria conflict", 
                             "Russia meddling investigation"],
  "schema:author": "cnn",
  "schema:headline": "What's at stake for Trump in Syria",
  "schema:search": "isis",
  "schema:thumbnailUrl": "https://cdn.cnn.com/cnnnext/dam/assets/180409184118-trump-military-briefing-syria-0409-story-body.jpg"
}

The New York Times


{
  "@type": "schema:NewsArticle",
  "@id": "https://www.nytimes.com/2018/02/28/world/middleeast/syrian-kurds-isis-american-offensive.html",
  "schema:datePublished": "2018-02-28T10:01:14+0000",
  "schema:dateModified": "2018-03-01T13:48:26Z",
  "schema:articleBody": "And we need them to finish this to finish this fight [...]",
  "schema:about": [http://dbpedia.org/resource/Kurds, http://dbpedia.org/resource/Ottoman_Empire,  http://dbpedia.org/resource/Syria, http://dbpedia.org/resource/Islamic_State_of_Iraq_and_the_Levant, "Defense and Military Forces", "United States Defense and Military Forces", "United States International Relations", "Syrian Democratic Forces"],
  "schema:author": http://dbpedia.org/page/The_New_York_Times,
  "schema:headline": "Amid Turkish Assault, Kurdish Forces Are Drawn Away From U.S. Fight With ISIS",
  "schema:search": "isis",
  "schema:thumbnailUrl": "https://www.nytimes.com/images/2018/02/28/us/28DC-military-alpha/28DC-military-alpha-articleLarge.jpg"
}

AlJazeera


{
  "@type": "schema:NewsArticle",
   "@id": "https://www.aljazeera.com/news/2018/04/putin-erdogan-rouhani-discuss-syrian-crisis-ankara-180403115527779.html",
  "schema:articleBody": "Turkey will host a trilateral meeting on the Syrian crisis between the [...]", 
  "schema:author": "Umut Uras", 
  "schema:headline": "Is there room for critical thinking in Islam?",
   "schema:search": "isis",
   "schema:thumbnailUrl": "https://www.aljazeera.com/mritems/Images/2017/11/22/4bb5ec00abca46db966d0c49a61e8689_18.jpg"
} 

Dabiq magazine


{
  "@type": "schema:Article", 
  "@id": "http://dashboard-trivalent.cluster.gsi.dit.upm.es/resources/Dabiq14-In-the-Words-of-the-Enemy", 
  "schema:articleBody": "On the first of Ramadan 1435H, therevival of the Khilafah was [...]", 
  "schema:author": http://dbpedia.org/page/Dabiq_(magazine), 
  "schema:headline": "Khilafah Declared" 
}

Rumiyah magazine


 {
   "@type": "schema:Article", 
   "@id": "http://dashboard-trivalent.cluster.gsi.dit.upm.es/resources/Rumiyah10-Military-and-Covert-Operations", 
   "schema:articleBody": "As the soldiers of the Khilafah continue waging war on the [...]", 
   "schema:author": http://dbpedia.org/page/Rumiyah_(magazine),   
   "schema:headline": "Military and Covert Operations"
 }

Senpy Plugins

Senpy plugins provide added value services for data analysis tasks, easing their implementation thanks to Senpy architecture. Each plugin has an entry and a semantically annotated output useful for linked data processes. For knowing more about Senpy, please visit Senpy documentation.

The following plugins are available at Senpy Trivalent Playground. 

 

Plugin
Workflow
Links

 

 

 

 

Bing 

Text translation tasks.

 

 

 

 
 

 

 

 

 

 COGITO 

 

 Extracts people, places and organization entities from different sources..

 

 

 

 

 

 

 

Pipelines

As it has been said, GSI Crawler allows to extract information from online and offline sources, enriching it following linked data principles, storing it and also provides visualization of the results. Thanks to the easy integration of Senpy plugins, it is possible to create customized pipelines which enrich data at each step. GSI Crawler makes use of Senpy plugins for creating valuable pipelines which result into richer analysis. Here are some examples:

 

  • Translator + Cogito Plugin
  • Translator + Cogito Plugin + ElasticSearch
  • PDF Scraper + Cogito Plugin + ElasticSearch

Translator + Cogito Plugin

This pipeline takes as input any tweet written in a source language (e.g. Arabic), translates it into a target language (e.g English) and extracts information such as people, places and organizations mentioned on it following linked data annotation principles.

 

 

The following images will illustrate a use case of this pipeline using the Senpy Trivalent playground. However, these plugins can also be accessed programmatically using the parameters described on their documentation pages.

  • We want to analyze the following text with the Bing plugin 
  • The following (shortened) response is obtained
  • Then, we feed the cogito plugin with the previous response 
  • Obtaining the following output 

 

 

 

Translator + Cogito Plugin + ElasticSearch

Apart from extracting valuable data from raw text, GSI Crawler pipelines allow to persist the analysis which are carried out so as to reuse them for later tasks.

 

 

 

This pipeline adds the Elastic Search layer with respect to the previous one, resulting into the following record.

 

PDF Scraper + Cogito Plugin + ElasticSearch

This pipeline extracts textual information from PDF files (for example Dabiq journals), enriches data with the Cogito plugin and stores it into ElasticSearch.

 

Playground

Trivalent Playground is an NLP tool based on semantic data for the treatment of radicalization data.


You can find the endpoint at:
http://trivalent-playground.gsi.upm.es/

Annotation

Corpus  Annotations can be done in the Trivalent Annotation Portal

Publications

  • A Model of Radicalization Growth using Agent-based Social Simulation, Tasio Méndez, J. Fernando Sánchez-Rada, Carlos A. Iglesias & Paul Cummings (2018). A Model of Radicalization Growth using Agent-based Social Simulation. In Proceedings of EMAS 2018. Stockholm, Sweden
  • Neural Domain Adaptation of Sentiment Lexicons, Oscar Araque, Marco Guerini, Carlo Strapparava & Carlos A. Iglesias (2017). Neural Domain Adaptation of Sentiment Lexicons. In Proceedings of ACII 2017. San Antonio, Texas, USA.

Datasets

Soon available

Downloads

  • GSI Crawler
  • Senpy plugins

GSI Crawler

github gsi-crawler docker gsi-crawler

Senpy plugins

github senpy docker senpy

© Copyright 2021 by Intelligent Systems Group.

  • Español Español
  • English (UK) English (UK)
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.