Making Sense of Microposts (#MSM2013)

Big things come in small packages

Concept Extraction Challenge

Theme: Making Sense of Microposts: Big things come in small packages

Award Sponsor: eBay


Existing concept extraction tools are intended for use over news corpora and similar document-based corpora with relatively long length. The aim of the challenge is to foster research into novel, more accurate concept extraction for (much shorter) Micropost data.

Concepts are defined as 'abstract notions of things'. For the purposes of this challenge we constrain the task to the extraction of entity concepts in Micropost data, characterised by a type and a value. We restrict the classification to four entity types — where the Micropost contains a reference to:

  1. a Person (PER) — full or partial person names
    Data sample:
    "Obama responds to diversity criticism"
    Extracted instance(s):

  2. a Location (LOC) — full or partial (geographical or physical) location names, including: cities, provinces or states, countries, continents and (physical) facilities
    Data sample:
    "Finally on the train to London ahhhh"
    Extracted instance(s):

  3. an Organisation (ORG) — full or partial organisation names, including academic, state, governmental, military and business or enterprise organisations
    Data sample:
    "NASA's Donated Spy Telescopes May Aid Dark Energy Search"
    Extracted instance(s):

  4. Miscellaneous (MISC) — a concept not covered by any of the categories above, but limited to one of the entity types: film/movie, entertainment award event, political event, programming language, sporting event and TV show.
    Data sample:
    "Okay, now this is getting seriously bizarre. Like a Monty Python script gone wrong."
    Extracted instance(s):
    MISC/Monty Python;


Last update: 16 Mar 2013

The (latest version of the) complete dataset (containing training and test data, with a 60% / 40% split) contains 4341 manually annotated microposts, on a variety of topics, including comments on the news and politics, collected from the end of 2010 to the beginning of 2011. We are also maintaining a list of changes to the dataset.

Training Dataset

TSV data with indices specified as:

  • Element 1: numeric ID of the micropost
  • Element 2: concepts found within the micropost — as semi-colon separated entity type/instance pairs (e.g. PER/Obama;ORG/NASA)
  • Element 3: micropost content — from which concepts have been detected and extracted

Sample snapshot (matching instances highlighted):

      173   ORG/Amazon;   _Mention_ there is no way one can explain this book that will sound reasonable and shame on Amazon !

      1844   PER/Obama;PER/Andy Borowitz;ORG/White House;   "Hu Presents Obama with Counterfeit DVD : Fake news by Andy Borowitz In a moving White House ceremony today , President ... _URL_"

      1846   Hurricane simulator : pay $ 2 to stand in a glass booth and get wind blown on you . This is real .
Test Dataset

Unindexed TSV data:

  • Element 1: numeric ID of the micropost
  • ss
  • Element 2:micropost content — from which concepts are to be detected and extracted

The challenge task is to detect and extract concepts contained in the microposts.

Sample snapshot:

      2573   Politics is the art of preventing people from taking part in affairs which properly concern them . <NEWLINE> - Paul Valery
      2574   "Pork chops , dirty rice , steamed vegetables , & Texas toast
Anonymisation and Special Terms
To ensure anonymity all username mentions in the microposts have been replaced with "_Mention_" and all URLs with "_URL_".

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License

Download the data

back to top

Submission & Evaluation

Deadline (extended): 17 20 Mar 2013
  • Intent to submit — 03 Mar 2013: new record in EasyChair with title (prefaced with "MSM2013 IE Challenge"), authors, and if available, a brief abstract (no more than 150 words).

  • Extended abstract: 2 page description of the approach taken, to allow reviewers insight into the quality and applicability of the solution
  • classification results as TSV
    • Element 1: numeric ID of each micropost
    • Element 2: concepts detected within the micropost — as semi-colon separated entity type/instance pairs (e.g. PER/Obama;ORG/NASA)

Sample submission:


      173   ORG/Amazon;

      1844   PER/Obama;PER/Andy Borowitz;ORG/White House;



Written submissions should be prepared according to the ACM SIG Proceedings Template, and should include author names and affiliations, and 3-5 keywords.
Submission will be via the EasyChair Conference System. Each submission should be made as a single, unencrypted zip file containing:

  • a TSV file with the tool/system name (e.g. awesomeo9000.tsv)
  • an extended abstract (2 pages) describing the approach taken, including tuning and/or testing using the training/test data split
  • a plain text file listing the archive contents, if any additional material is included

  • Note — Option to submit multiple runs:
    You may submit up to 3 individual runs that tweak your approach by modifying selected parameters for identifying and extracting concepts. If you do so you must specify clearly in your extended abstract the modifications applied to each labelled submission. In this case the submission is as above, AND
    • the complete solution must be submitted by the deadline
    • should contain each of up to 3 TSV files with the tool/system name with "_n" appended to each (e.g. awesomeo9000_1.tsv, awesomeo9000_2.tsv ...):
    • the extended abstract (2 pages) must state clearly which modification matches each labeled run.


Evaluation of accuracy

Accuracy will be judged using the f-measure (with β = 1 to allow precision and recall to be weighted equally). This will be computed on a per entity-type/-instance pair basis before averaging across the four concepts. Entity type-specific f-measure values for each team will also be provided by the evaluators to assess how each approach fares across the entity types.

For submissions containing multiple runs, each will be evaluated independently, and the submission judged based on that which performs best.

Each written submission describing the approach taken will also be reviewed, and will contribute to the determination of the winning submission. Accepted abstracts will be published in a separate compendium via CEUR, and will also be available from the workshop website. All accepted submissions will be presented as posters and/or demos, and a selection will be included in the workshop presentations.

back to top

Challenge Award

A prize of US$ 1,500, generously sponsored by eBay, will be awarded to the challenge winner. The Challenge Evaluation Committee will judge the submissions based on the evaluation criteria and the extended abstracts describing the approach taken.

Information extraction challenges associated with treating eBay items, often of short textual content, are very similar to those involved with treating other short textual microposts. By teaming up with eBay to make the challenge possible, the MSM workshop organisers wish to highlight this aspect of the micropost extraction research question.

back to top

Challenge Chair

Amparo Elizabeth Cano, KMi, The Open University, UK


Evaluation Committee

Naren Chittar, eBay, USA
Óscar Corcho, Universidad Politécnica de Madrid
Danica Damljanovic, Kuato Studios, London, UK
Anna Lisa Gentile, University of Sheffield, UK
Diana Maynard, The University of Sheffield, UK
Peter Mika, Yahoo! Research, Spain
Enrico Motta, KMi, The Open University, UK
Daniel Preotiuc, The University of Sheffield, UK
Alan Ritter, University of Washington, USA
Guiseppe Rizzo, Eurecom, France
Raphaël Troncy, Eurecom, France
Victoria Uren, Aston Business School, UK
Andrea Varga, The University of Sheffield, UK

Download the detailed challenge call [text]


Contact workshop organisers or challenge chair and/or join the #MSM2013 mailing list

back to top