Note: A full description with examples of the data are given in the french version of this page.
The DEFT (défi fouille de textes) workshop is a text mining competition organized since 2005 and that focuses each year on different research themes. This year, the DEFT workshop is co-organized by the LINA lab and INIST.
Table des matières
For the 12th edition of DEFT, we propose to address the issue of indexing scientific documents in French. The task is to generate sets of keyphrases for bibliographic records from four different research areas (linguistics, information science, archeology and chemistry) for which the reference keyphrase annotations were produced by professional indexers.
As in the 2012 edition of the DEFT workshop, we propose to address the issue of indexing scientific documents in French through keyphrases. While in DEFT-2012 the task was to identify author-assigned keyphrases, this year’s task focuses on identifying keyphrases produced by professional indexers. In constrast to author-assigned keyphrases, those assigned by professional indexers are following a standard procedure designed for indexing records in the context of a bibliographic database. Professional indexers rely on both the content of the document and a domain-specific thesaurus to assign coherent and comprehensive keyphrases. Coherence implies that a concept is always represented by the same keyphrase for all documents of a given domain. Using the domain thesaurus is therefore particularly important for identifying keyphrases for controlled indexing. However, the comprehensiveness of the set of keyphrases also implies that professional indexers assign document-specific keyphrases that do not necessarily appear in the domain thesaurus.
Proposed methods should be able to identify important concepts (keyphrases), occurring or not in the domain thesaurus. Four collections of bibliographic records from four different research areas (linguistics, information science, archeology and chemistry) are provided. Documents are already pre-processed (sentence splitting, tokenization and Part-Of-Speech tagging). Participants are invited to indicate whether their approach rely or not on the provided domain specific thesaurus and can rely on external ressources.
- Registration: from Februrary 17, 2016
- Release of the training set: March 2, 2016
- Testing phase: 3 days to choose between April 11, 2016 and April 17, 2016
- Deadline for paper submission: June 3, 2016
- Workshop: July 4, 2016
A full description with examples of the data are given in the french version of this page.
Training and development set
From February 22, 2016, registered teams will be able to download training and development sets. This dataset is composed of bibliographic records (titles and abstracts) in TEI format from four different research areas (linguistics, information science, archeology and chemistry), manually indexed by professional indexers from INIST. About 700 bibliographic records per domain will be available along with domain specific thesaurus and reference keyphrases. Bibliographic records are already pre-processed (sentence splitting, tokenization and Part-Of-Speech tagging). More details about the datasets are given on the website.
Test set will be available from April 11, 2016.
Methods will be evaluated in terms of precision (P), recall(R) and f-measure (F) against the reference keyphrases. Stemming (porter algorithm) will be applied to
reduce the number of mismatches.