GSI Crawler
GSI Crawler is a system developed by GSI group capable of extracting, analyzing, enriching and storing information from online sources, as well as displaying visualizations of the gathered information. These flows of information are called pipelines, which start with scraper modules and are followed by the analysis and enrichment modules which belong to the Senpy plugins community.
This dashboard shows some of the capabilities of GSI Crawler.
Scrapers
They allow to extract information from several web sources under the news of social media categories. Currently, the available newspapers are CNN News, The New York Times and AlJazeera. Additionally, it is also possible to extract information from PDF sources such as Dabiq Magazine, which has been the official Daesh propaganda magazine for years.
CNN News
{"@type": "schema:NewsArticle",
"@id": "https://www.cnn.com/2018/04/09/politics/syria-donald-trump/index.html",
"schema:dateModified": "2018-04-09T23:59:23Z",
"schema:articleBody": "A decision by President Donald Trump to use force in retaliation [...]",
"schema:about": [http:/dbpedia.org/resource/Donald_Trump, "Biological and chemical weapons", "Syria conflict",
"Russia meddling investigation"],
"schema:author": "cnn",
"schema:headline": "What's at stake for Trump in Syria",
"schema:search": "isis",
"schema:thumbnailUrl": "https://cdn.cnn.com/cnnnext/dam/assets/180409184118-trump-military-briefing-syria-0409-story-body.jpg"
}
The New York Times
{
"@type": "schema:NewsArticle",
"@id": "https://www.nytimes.com/2018/02/28/world/middleeast/syrian-kurds-isis-american-offensive.html",
"schema:datePublished": "2018-02-28T10:01:14+0000",
"schema:dateModified": "2018-03-01T13:48:26Z",
"schema:articleBody": "And we need them to finish this to finish this fight [...]",
"schema:about": [http://dbpedia.org/resource/Kurds, http://dbpedia.org/resource/Ottoman_Empire, http://dbpedia.org/resource/Syria, http://dbpedia.org/resource/Islamic_State_of_Iraq_and_the_Levant, "Defense and Military Forces", "United States Defense and Military Forces", "United States International Relations", "Syrian Democratic Forces"],
"schema:author": http://dbpedia.org/page/The_New_York_Times,
"schema:headline": "Amid Turkish Assault, Kurdish Forces Are Drawn Away From U.S. Fight With ISIS",
"schema:search": "isis",
"schema:thumbnailUrl": "https://www.nytimes.com/images/2018/02/28/us/28DC-military-alpha/28DC-military-alpha-articleLarge.jpg"
}
AlJazeera
{
"@type": "schema:NewsArticle",
"@id": "https://www.aljazeera.com/news/2018/04/putin-erdogan-rouhani-discuss-syrian-crisis-ankara-180403115527779.html",
"schema:articleBody": "Turkey will host a trilateral meeting on the Syrian crisis between the [...]",
"schema:author": "Umut Uras",
"schema:headline": "Is there room for critical thinking in Islam?",
"schema:search": "isis",
"schema:thumbnailUrl": "https://www.aljazeera.com/mritems/Images/2017/11/22/4bb5ec00abca46db966d0c49a61e8689_18.jpg"
}
Dabiq magazine
{ "@type": "schema:Article", "@id": "
http://dashboard-trivalent.cluster.gsi.dit.upm.es/resources/Dabiq14-In-the-Words-of-the-Enemy", "schema:articleBody": "On the first of Ramadan 1435H, therevival of the Khilafah was [...]", "schema:author": http://dbpedia.org/page/Dabiq_(magazine), "schema:headline": "Khilafah Declared" }
Rumiyah magazine
{
"@type": "schema:Article",
"@id": "http://dashboard-trivalent.cluster.gsi.dit.upm.es/resources/Rumiyah10-Military-and-Covert-Operations",
"schema:articleBody": "As the soldiers of the Khilafah continue waging war on the [...]",
"schema:author": http://dbpedia.org/page/Rumiyah_(magazine),
"schema:headline": "Military and Covert Operations"
}
Senpy Plugins
Senpy plugins provide added value services for data analysis tasks, easing their implementation thanks to Senpy architecture. Each plugin has an entry and a semantically annotated output useful for linked data processes. For knowing more about Senpy, please visit Senpy documentation.
The following plugins are available at Senpy Trivalent Playground.
Plugin |
Workflow |
Links |
Bing |
Text translation tasks. |
|
COGITO |
Extracts people, places and organization entities from different sources.. |
|
Pipelines
As it has been said, GSI Crawler allows to extract information from online and offline sources, enriching it following linked data principles, storing it and also provides visualization of the results. Thanks to the easy integration of Senpy plugins, it is possible to create customized pipelines which enrich data at each step. GSI Crawler makes use of Senpy plugins for creating valuable pipelines which result into richer analysis. Here are some examples:
- Translator + Cogito Plugin
- Translator + Cogito Plugin + ElasticSearch
- PDF Scraper + Cogito Plugin + ElasticSearch
Translator + Cogito Plugin
This pipeline takes as input any tweet written in a source language (e.g. Arabic), translates it into a target language (e.g English) and extracts information such as people, places and organizations mentioned on it following linked data annotation principles.
The following images will illustrate a use case of this pipeline using the Senpy Trivalent playground. However, these plugins can also be accessed programmatically using the parameters described on their documentation pages.
- We want to analyze the following text with the Bing plugin
- The following (shortened) response is obtained
- Then, we feed the cogito plugin with the previous response
- Obtaining the following output
Translator + Cogito Plugin + ElasticSearch
Apart from extracting valuable data from raw text, GSI Crawler pipelines allow to persist the analysis which are carried out so as to reuse them for later tasks.
This pipeline adds the Elastic Search layer with respect to the previous one, resulting into the following record.
Playground
Trivalent Playground is an NLP tool based on semantic data for the treatment of radicalization data.
You can find the endpoint at: http://trivalent-playground.gsi.upm.es/
Annotation
Corpus Annotations can be done in the Trivalent Annotation Portal
Publications
- A Model of Radicalization Growth using Agent-based Social Simulation, Tasio Méndez, J. Fernando Sánchez-Rada, Carlos A. Iglesias & Paul Cummings (2018). A Model of Radicalization Growth using Agent-based Social Simulation. In Proceedings of EMAS 2018. Stockholm, Sweden
- Neural Domain Adaptation of Sentiment Lexicons, Oscar Araque, Marco Guerini, Carlo Strapparava & Carlos A. Iglesias (2017). Neural Domain Adaptation of Sentiment Lexicons. In Proceedings of ACII 2017. San Antonio, Texas, USA.