Search:

PF/Tijah Home

Documentation

 

UNIVERITY OF TWENTE

About INEX Entity Search

What is entity search?

Entity Ranking is a task of matching a text query to a predefined set of entities that are usually mentioned in text documents and/or described explicitly. According to the corresponding Wikipedia entry, “an entity is something that has a distinct, separate existence, though it need not be a material existence". An entity could be, for instance, an organization, a person, an animal, a location, an event etc. The purpose of search is to find such a group of entities that could be semantically defined by a user query (expected to be very short) and the entity type (optionally). Entity ranking systems are useful to provide a very condensed mind-map-like overview of a given topic.

Example: query: “dancing pigs”, category: “fictional pigs”. Result:

What entities to search?

We use INEX Wikipedia XML collection as our document set (its 2006 snapshot). It differs from the real Wikipedia in some respects (size, document format, category tables), but it is a very realistic approximation. In Wikipedia, wiki-pages correspond to entities which are organized into (or attached to) categories. References to entities and their categories occur frequently in natural language. Basically, our system allows to search for those entities that always have a wiki-page (although it may be incomplete or empty), or in other words, for wiki-pages that describe entities.

How to search?

The system expects a user to provide the following information to describe the information need:

  • Topic: a short query describing needed entities
  • Category id: the ID of the root category that should contain the needed entities
  • Level: the search will be limited to children categories of the root category up to this level.
(Root Category (Level 0) and its Children (Level 1) and Children of its Children (Level 2) -> ...)

Note that there are 113,483 categories in INEX Wikipedia XML collection and they are not organized in a strict tree structure. So, cycles in the category graph are common. An entity that belongs to a given category generally does not belong to its ancestors. That is why the explicit expansion of the category set up to a given Level is important. Although, the expansion along parental relation is possible, we expand only along children relation.

How does the black box work?

The system currently ranks entities using a trivial approach: to rank entities by the relevance of their Wikipedia-articles. Articles are ranked with respect to a query using the language modeling based approach to IR and then consequently filtered with the provided and expanded list of categories. The retrieval and filtering is done by PF/Tijah retrieval system. For details about how to index the INEX collection, see our INEX demo description. Note that we indexed categories and articles as separate collections. XQuery that we use to generate the entity ranking result is the following:

     let $start := 1
     let $query := "dancing pigs"
     let $id := 9964
     let $level := 2
     let $opt := <TijahOptions returnNumber="300" />
     let $tquery := tijah:tokenize($query)
     let $nexi := concat("//article[about(.,", $tquery, ")]")
     let $ids := doc("categories.xml")//category[tag/@id =
                        $id]/childs[@level < $level]/child/@id
     let $matching := for $x in tijah:queryall($nexi, $opt)
       where $x/category/@id = $id or $x/category/@id = $ids return $x
     let $r := count($matching)
     return <result total=""> {
      for $a in subsequence($matching, $start, 10)
      return <item> { $a/name, <tags> { $a/category } </tags> } </item>
     } </result>

References

  • INEX Entity-Ranking Track Guidelines. link
  • Djoerd Hiemstra, Henning Rode, Roel van Os and Jan Flokstra. PF/Tijah: text search in an XML database system, In Proceedings of * the 2nd International Workshop on Open Source Information Retrieval (OSIR), pages 12-17, 2006. full text
  • Tsikrika, T., Serdyukov, P., Rode, H., Westerveld, T., Aly, R., Hiemstra, D. and de Vries, A. Structured Document Retrieval, Multimedia Retrieval, and Entity Ranking Using PF/Tijah. In Proceedings of the 6th Initiative on the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, 2007 full text.
  • James A. Thom, Jovan Pehcevski and Anne-Marie Vercoustre. Use of Wikipedia Categories in Entity Ranking. In Proceedings of the 12th Australasian Document Computing Symposium, Melbourne, Australia, 2007. full text.
Edit - Print - Recent Changes - Search
Page last modified on June 20, 2008, at 01:35 PM