Information Extraction service

The IE service extracts structured data from web pages. This demo uses a restricted domain of bicycle product advertisements, where we extract structured information such as bicycle name, make, price, picture, color, size, year, and also a number of components a bike may have (together about 12 attributes). A GUI to this web service is located at http://eso.vse.cz/~labsky/cgi-bin/client

Input

action (xsd:string)

This can be either extract, annotate, or display. Setting action to extract will return a list of extracted instances, annotate will return an annotated page, and display will only return the original (unannotated) page.

url (xsd:string)

The absolute url of the page to be processed. Sample documents can be addressed by a dedicated test:// scheme, e.g. test://h0001.html.

model (xsd:string)

This is the HMM model to be used for annotation. Currently, only bikes/all_naive/trn_all_0 is enabled.

format (xsd:string)

Set this to xml to obtain XML list of instances on output. If not specified, a complete annotated page will be returned. An instance table will be at the top of the page if action was extract.

Output

return (xsd:base64Binary)

This will contain BASE64-encoded output of the extractor. This is either an annotated HTML page or an XML list of instances.

Description

The extractor is based on a Hidden Markov Model (HMM), trained using 90 web pages in which the desired product attributes were manually labeled (these 90 documents are scattered through the first 100 sample documents). The HMM provides a document with annotated attributes. For the extraction of structured instances, we group attributes that belong together (e.g. bike name, its price and picture) using a simple sequential algorithm. See details.

WSDL

WSDL: http://rainbow.vse.cz/services/IEService

There will be nicely formatted WSDL listing in a future. For now just plain listing.

<?xml version="1.0"?>
<definitions name="Annotate"
             targetNamespace="urn:Annot"
             xmlns:typens="urn:Annot"
             xmlns:xsd="http://www.w3.org/2001/XMLSchema"
             xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/"
             xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/"
             xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/"
             xmlns="http://schemas.xmlsoap.org/wsdl/">

  <!-- Messages for IE API -->
  <message name="doAnnotate">
    <part name="action"         type="xsd:string" />
    <part name="url"            type="xsd:string" />
    <part name="model"          type="xsd:string" />
    <part name="format"         type="xsd:string" />
  </message>
  <message name="doAnnotateResponse">
    <part name="return"         type="xsd:base64Binary" />
  </message>

  <!-- Port for IE Web APIs, "Annotate" -->
  <portType name="AnnotatePort">
    <operation name="Annotate">
      <input message="typens:doAnnotate" />
      <output message="typens:doAnnotateResponse" />
    </operation>
  </portType>

  <!-- Binding for IE APIs - RPC, SOAP over HTTP -->
  <binding name="AnnotateBinding" type="typens:AnnotatePort">
    <soap:binding style="rpc" transport="http://schemas.xmlsoap.org/soap/http" />
    <operation name="Annotate">
      <soap:operation soapAction="urn:Annot#Annotate" />
      <input>
        <soap:body use="encoded" namespace="urn:Annot" encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" />
      </input>
      <output>
        <soap:body use="encoded" namespace="urn:Annot" encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" />
      </output>
    </operation>
  </binding>

  <!-- Endpoint for Google Web APIs -->
  <service name="AnnotateService">
    <port name="AnnotatePort" binding="typens:AnnotateBinding">
      <soap:address location="http://eso.vse.cz/~labsky/cgi-bin/annotserver_cgi.pl" />
    </port>
  </service>

</definitions>