Automated Linking Data with Apache Stanbol

Intervenant⋅e⋅s

Résumé

This talk will introduce the Stanbol_ project and showcase how it can be integrated in traditional Enterprise Content Management solutions.

Stanbol is an Open Source project under incubation at the Apache Software Foundation. Its goal is to provide Web and CMS developers with a set of HTTP / RESTful services to help them integrate semantic technologies into their products and web sites.

The following Stanbol services are currently under active developments:

  • Enhancement engines: use Natural Language Processing tools such as Apache OpenNLP to extract knowledge (topics, named entities, facts) from unstructured content and link it to unambiguous URIs from reference knowledge bases;

  • Entity Hub: a Linked Data indexing cache built on top of Apache Solr, Clerezza and Jena that comes with precomputed indexes and live connectors to popular knowledge bases such as DBpedia, Geonames , YAGO...

  • Content Hub: a faceted search engine based on Solr to search for content using the knowledge automatically extracted by the enhancement engines;

  • CMS bridges to lift the structured content of document repositories using the JCR and CMIS Content_Management_Interoperability_Services access protocols (using Apache Chemistry) and store the result into a triple store suitable for SPARQL access;

  • Rules engine based on Apache Jena for knowledge refactoring (e.g. convert extracted knowledge into the rich snippet vocabulary for SEO), integrity checks, merging rules, deductive inference...

Automatically extracting and post-processing structured knowledge from semi-structured content it a key step towards better interoperability of the user intents and building smarter applications. Apache Stanbol aims to make it as easy as possible to achieve that goal.

Fichiers joints

Apache Stanbol overiew slides (Stanbol_Overview_2012-04.pdf)