RAINBOW homepage

Project description

People

Applications

Downloads and shows

Bibliography

Related projects


logo RAINBOW

Web made by Vojtech Svatek

Project description

The project is currently supported by the grant no.201/03/1318 of the Czech Science Foundation (CSF), "Intelligent analysis of web content and structure" (2003-2005).

The goal of the project is to develop a flexible architecture for knowledge-based analysis of the WWW. The Rainbow system employs the web service and semantic web technology to analyse and present to a user or computer agent the content and structure of legacy websites. The analysis of a website is multiway (see paper Sv03d), with results being integrated. The conspicuous feature of analysis services is their systematic categorisation according to four dimensions: abstract type of task (classification, retrieval, extraction), type of ‘current’ object (e.g. document, hyperlink, image), type of analysed data (e.g. free text, HTML tags, link topology, image data), and problem domain (e.g. bicycle sales). This four-dimensional approach is captured by the so-called ‘task-object-datatype-domain’ (TODD) knowledge-level framework (see paper Sv04c) and by an associated collection of ontologies (see paper La03).

In Summer 2005, the Rainbow system comprises the following analysis services:

  • Information extraction (IE) tool based on Hidden Markov Models trained from pre-classified data (see paper La04b)
  • Link topology analyser for discovery of website navigation structure, based on graph theory (see papers Vo03 and Vo04)
  • Several image analysers, operating on colour histograms (see paper Va02), bitmap layout similarity (see paper La05b) and image dimensions; see paper La05c for an integrated view
  • Rule-based classifier of URL strings
  • Free-text analyser (sentence extractor) based on a collection of ‘important information indicators’ bootstrapped from web directory data (see paper Ka02)
  • Extractor of (selected) META tag content.

We also plan to integrate third-party tools.

In addition, the infrastructure includes:

  • Website downloader
  • Multiple pre-processing tools
  • Source data repository
  • Procedural control application, which calls individual services and collects results
  • Repository for extracted results.

The source data repository is provided by the full-text & native XML database tool AmphorA, developed by the Amphora Research Group, TU Ostrava (partner in the CSF project). Thanks to sophisticated XML and text indexing, it enables fast XML querying as well as text retrieval (see paper Kr05a).

As result repository, we use Sesame (by Aduna/Aidministrator, NL) with the help of the expertise provided by the Knowledge Representation and Reasoning Group at the Vrije Universiteit Amsterdam. The stored RDF facts are retrieved using SeRQL as query language (see the Applications section and paper Sv04b).