![]() ![]() ![]() |
![]() About INEX Entity SearchWhat is entity search? Entity Ranking is a task of matching a text query to a predefined set of entities that are usually mentioned in text documents and/or described explicitly. According to the corresponding Wikipedia entry, “an entity is something that has a distinct, separate existence, though it need not be a material existence". An entity could be, for instance, an organization, a person, an animal, a location, an event etc. The purpose of search is to find such a group of entities that could be semantically defined by a user query (expected to be very short) and the entity type (optionally). Entity ranking systems are useful to provide a very condensed mind-map-like overview of a given topic. Example: query: “dancing pigs”, category: “fictional pigs”. Result: ![]() What entities to search? We use INEX Wikipedia XML collection as our document set (its 2006 snapshot). It differs from the real Wikipedia in some respects (size, document format, category tables), but it is a very realistic approximation. In Wikipedia, wiki-pages correspond to entities which are organized into (or attached to) categories. References to entities and their categories occur frequently in natural language. Basically, our system allows to search for those entities that always have a wiki-page (although it may be incomplete or empty), or in other words, for wiki-pages that describe entities. How to search? The system expects a user to provide the following information to describe the information need:
(Root Category (Level 0) and its Children (Level 1) and Children of its Children (Level 2) -> ...)
Note that there are 113,483 categories in INEX Wikipedia XML collection and they are not organized in a strict tree structure. So, cycles in the category graph are common. An entity that belongs to a given category generally does not belong to its ancestors. That is why the explicit expansion of the category set up to a given Level is important. Although, the expansion along parental relation is possible, we expand only along children relation. How does the black box work? The system currently ranks entities using a trivial approach: to rank entities by the relevance of their Wikipedia-articles. Articles are ranked with respect to a query using the language modeling based approach to IR and then consequently filtered with the provided and expanded list of categories. The retrieval and filtering is done by PF/Tijah retrieval system. For details about how to index the INEX collection, see our INEX demo description. Note that we indexed categories and articles as separate collections. XQuery that we use to generate the entity ranking result is the following: let $start := 1
let $query := "dancing pigs"
let $id := 9964
let $level := 2
let $opt := <TijahOptions returnNumber="300" />
let $tquery := tijah:tokenize($query)
let $nexi := concat("//article[about(.,", $tquery, ")]")
let $ids := doc("categories.xml")//category[tag/@id =
$id]/childs[@level < $level]/child/@id
let $matching := for $x in tijah:queryall($nexi, $opt)
where $x/category/@id = $id or $x/category/@id = $ids return $x
let $r := count($matching)
return <result total=""> {
for $a in subsequence($matching, $start, 10)
return <item> { $a/name, <tags> { $a/category } </tags> } </item>
} </result>
References
|