Semantically Annotated Snapshot of the English Wikipedia (SW v.1)

Hugo Zaragoza (hugoz@yahoo-inc.com), Jordi Atserias, Massimiliano Ciaramita (Yahoo! Research Barcelona) and Giuseppe Attardi (U. Pisa, on sabatical at Yahoo! Research).

NOTE: If you distribution is not dated December 2007 it means you are using corruputed data! There was a bug in the first release of SW1 which removed many sentences. This had to do with the creation of the “link” column.

Introduction

We are releasing a snapshot of the English Wikipedia, tagged with the following information:

  • Sentence and token splitting
  • Part of Speech, Named Entities and Semantic tagging.
  • Dependency Parsing.

Furthermore, we are releasing some data structured obtained from this data:

  • Entitiy lists
  • Entity co-occurrence lists
  • Entity Containment Graph

All this data was obtained using Open Source software on the Wikipedia, and therefore it can be replicated at any site. We are providing it here in the current form as a resource for researchers.

The corpus is made of 1,490,688 entries, automatically split into 74,924,392 sentences. It contains 148.8M occurrences of 20.3M unique WSJ named entities (entity string+type).

Please email questions to hugoz@yahoo-inc.com

Licensing

This data is licensed under the GNU Free Documentation License. It copies material from the English Wikipedia (http://en.wikipedia.org). Links back to specific entries used can be found in the data.

Referencing

Please reference this data as follows:

  • Hugo Zaragoza, Jordi Atserias, Massimiliano Ciaramita and Giuseppe Attardi, Semantically Annotated Snapshot of the English Wikipedia v.1 (SW1), http://www.yr-bcn.es/semanticWikipedia, 2007.

Bibtex:

@misc{zaragoza:sw1,
  title = "Semantically Annotated Snapshot of the English Wikipedia v.1 (SW1)",
  author = "H. Zaragoza and J. Atserias and M. Ciaramita and G. Attardi",
  howpublished = "\url{http://www.yr-bcn.es/semanticWikipedia}",
  year = "2007"
}

(Previous snapshots:)

@misc{zaragoza:sw0,
  title = "Semantically Annotated Snapshot of the English Wikipedia v.0 (SW0)",
  author = "H. Zaragoza and J. Atserias and M. Ciaramita and G. Attardi",
  howpublished = "\url{http://www.yr-bcn.es/semanticWikipedia}",
  year = "2007"
}

The tagger used to generate this data is open source andd can be found here: SuperSense Tagger available as opensource at http://sourceforge.net/projects/supersensetag/. The suggested reference to this tagger is:

  • [Ciaramita+Altun’2006] M. Ciaramita, Y. Altun, “Broad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger” Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), (2006)

The dependency parser used to generate this data is open source and can be found here: http://desr.sourceforge.net/doc. The suggested reference to this parser is:

  • [Attardi+’2007] G. Attardi, F. Dell’Orletta, M. Simi, A. Chanev and M. Ciaramita. Multilingual Dependency Parsing and Domain Adaptation using DeSR. Proceedings the CoNLL Shared Task Session of EMNLP-CoNLL 2007, Prague, 2007.

The entity graph was built using the fastutils, MG4J and WebGraph opensource libraries from the University of Milan.

TagSets

The different tagsets are explained here.

See the type distribution here.

Quality

All tags (POS, semantic, dependencies) were obtained by automatic processes that make many mistakes. Unfortunately we do not have precise information on the quality or accuracy of the different tags.

There are several sources of indirect data that you can use to decide if the quality of our data is good enough for your needs:

  • quality of the tagger and parser on standard collections: see references above.
  • quality of the tagger in our collection at the sentence level: see
    • Peter Mika, Massimiliano Ciaramita, Hugo Zaragoza and Jordi Atserias, 2008. Learning to Tag and Tagging to Learn: A Case Study on Wikipedia, in IEEE Intelligent Systems, vol.23 n.5 pp.26-33. http://grupoweb.upf.es/hugoz/pdf/mika_ieee08.pdf
  • other papers published with this data: see section below

.

Other Projects and Publications Using this Data

  • Hugo Zaragoza, Henning Rode, Peter Mika, Jordi Atserias, Massimiliano Ciaramita & Giuseppe Attardi. Ranking Very Many Typed Entities on Wikipedia. In CIKM ‘07: Proceedings of the sixteenth ACM international conference on Information and knowledge management , 2007. pdf
  • Peter Mika, Massimiliano Ciaramita, Hugo Zaragoza and Jordi Atserias, 2008. Learning to Tag and Tagging to Learn: A Case Study on Wikipedia, in IEEE Intelligent Systems, vol.23 n.5 pp.26-33. pdf
  • D. Vallet, H. Zaragoza. 2008. Inferring the Most Important Types of a Query: a Semantic Approach. SIGIR’2008 (poster). pdf

Data

The corpus is distributed through Yahoo!’s Webscope program to request it, please email Kim Capps-Tanaka at research-data-requests@yahoo-inc.com and ask for “Yahoo! Semantically Annotated Snapshot of English Wikipedia SW1”.

Data Source

The SW1 corpus is a snapshot of the English Wikipedia dated from 2006-11-04 processed with a number of public-available NLP tools. In order to build SW1, we started from the XML-ized wikipedia dump distributed by the University of Amsterdam http://ilps.science.uva.nl/WikiXML/. This snapshot of the English wikipedia contains 1,490,688 entries (excluding redirects). First, the text is extracted from the XML entry and split into sentences using simple heuristics. Then we ran several syntactic and semantic NLP taggers on it and collected their output.

Data format

The multitag format contains all the Wikipedia text plus all the semantic tags. All other data files can be reconstructed from this. The multitag format is described bellow.

A multitag file contains several Wikipedia entries. The Wikipedia snapshot was cut into 3000 multitag files each containing roughly 500 entries.

Why not XML?


Obtaining the dataset

The corpus is now distributed through Yahoo!’s Webscope program and no longer available for download from this page.

To request it, please email Kim Capps-Tanaka at research-data-requests@yahoo-inc.com and ask for “Yahoo! Semantically Annotated Snapshot of English Wikipedia SW1”.

Multitag format:
# The Semantically Annotated Snapshot of Wikipedia (SW v.1) is a modification
#  of the English Wikipedia.
# 
# More information about SW can be found at: 
#  http://www.yr-bcn.es/dokuwiki/doku.php?id=semantically_annotated_snapshot_of_wikipedia
#
# More information about Wikipedia can be found at: 
#  http://en.wikipedia.org
# 
# SW1 is licensed under the GNU Free Documentation License 
#  (http://www.gnu.org/copyleft/fdl.html). It copies material from the English
#  Wikipedia (http://en.wikipedia.org). Links back to specific entries used can
#  be found in the data.
# 
# Hugo Zaragoza (hugoz@yaho-inc.com) & Jordi Atserias, 10th of December 2007.
# 
FILENAME wiki816
token   POS     lemma   CONL    WNSS    WSJ     ana     head    deplabel     link
%%#DOC  wiki816.24176   
%%#PAGE Pablo_Picasso 
.....
%%#SEN 22476 wx10  
Pablo   NNP     pablo   B-PER   B-noun.person   B-E:PERSON      0       2       NMOD    0
Picasso NNP     picasso I-PER   I-noun.person   I-E:PERSON      0       14      SBJ     0
(       (       (       0       0       0       0       4       P       0
October NNP     october 0       B-noun.time     B-T:DATE:DATE   0       2       PRN     B-/wiki/October_25
25      CD      25      0       B-adj.all       I-T:DATE:DATE   0       4       NMOD    I-/wiki/October_25
,       ,       ,       0       0       I-T:DATE:DATE   0       4       P       0
1881    CD      1881    0       0       I-T:DATE:DATE   0       9       NMOD    B-/wiki/1881
–       NNP     –       0       0       I-T:DATE:DATE   0       9       NMOD    0
April   NNP     april   0       B-noun.time     I-T:DATE:DATE   0       4       NMOD    B-/wiki/April_8
8       CD      8       0       0       I-T:DATE:DATE   0       9       NMOD    I-/wiki/April_8
,       ,       ,       0       0       I-T:DATE:DATE   0       4       P       0
1973    CD      1973    0       0       I-T:DATE:DATE   0       4       NMOD    B-/wiki/1973
)       )       )       0       0       0       0       4       P       0
was     VBD     be      0       B-verb.stative  0       0       0       ROOT    0
a       DT      a       0       0       0       0       18      NMOD    0
Spanish JJ      spanish B-MISC  B-adj.pert      B-E:NORP:NATIONALITY    0       18      NMOD    B-/wiki/Spain

painter NN      painter 0       B-noun.person   B-E:PER_DESC    0       18      COORD   B-/wiki/Painter
and     CC      and     0       0       0       0       14      VMOD    0
sculptor        NN      sculptor        0       B-noun.person   B-E:PER_DESC    0       18      COORD   B-/wiki/Sculpture
.       .       .       0       0       0       0       14      P       0
%%#SEN 22477 wx11
One     CD      one     0       0       B-N:CARDINAL    0       13      ADV     0
of      IN      of      0       0       0       0       1       NMOD    0
of      IN      of      0       0       0       0       1       NMOD    0
the     DT      the     0       0       0       0       6       NMOD    0
most    RBS     most    0       B-adv.all       0       0       5       AMOD    0
recognized      VBN     recognize       0       B-adj.all       0       0       6       NMOD    0
figures NNS     figure  0       B-noun.quantity 0       0       2       PMOD    0
in      IN      in      0       0       0       0       6       NMOD    0
20th    JJ      20th    0       B-adj.all       B-T:DATE:DATE   0       10      NMOD    0
century NN      century 0       B-noun.time     I-T:DATE:DATE   0       10      NMOD    0
art     NN      art     0       B-noun.artifact 0       0       7       PMOD    B-/wiki/Art
,       ,       ,       0       0       0       0       13      P       0
he      PRP     he      0       0       0       Pablo   Picasso 13      0
is      VBZ     be      0       B-verb.stative  0       0       0       ROOT    0
best    RB      best    0       B-adv.all       0       0       13      ADV     0
known   VBN     know    0       B-adj.all       0       0       13      VC      0
as      IN      as      0       0       0       0       15      ADV     0
the     DT      the     0       0       0       0       18      NMOD    0
co-founder      NN      cofounder       0       B-noun.person   B-E:PER_DESC    0       16      PMOD    0
,       ,       ,       0       0       0       0       15      P       0
along   IN      along   0       0       0       0       15      ADV     0
with    IN      with    0       0       0       0       20      PMOD    0
Georges NNP     george  B-PER   B-noun.person   B-E:PERSON      0       23      NMOD    B-/wiki/Georges_Braque
Braque  NNP     braque  I-PER   I-noun.person   I-E:PERSON      0       21      PMOD I-/wiki/Georges_Braque
...

  • Each file starts with a number of # comment lines, followed by a line indicating the filename and a line indicating column headers
  • Then each document starts with the line #DOC <documentID> * Each sentence ends with the line #SEN <sentence_number>
  • Content lines are tab separated; there is one column for the tokens plus one column per tagset.
  • TAG columns are (see tagsets for detailed description and references)
    • token : token
    • POS : part of speech
    • lemma : token lemma
    • CONL : CONL tag
    • WNSS : WordNetSuperSense tag
    • WSJ: Wall Street Journal Entities tag
    • ana: anaphora resolution (very basic algorithm: last person mentioned)
    • head: head of dependency
    • deplabel : dependency label
    • link: href or wikipedia link in the original document
  • Entities can encompass several tags. Starting and ending of entities are indicated as follows:
    • 0 : no tag
    • B-TAG : starting of a tagged entity
    • I-TAG : continuation of a tagged entity
    • (Therefore, an entity should start with a token marked B-TAG, contain zero or more extra tokens marked by I-TAG, and ends with (and not including) a token marked 0, B-TAG or the end of a line. However, this is not enforced, and you will find some sentences where I- tags appear without a starting B-).

Files, Entries and Documents: A “document” is a wikipedia entry (a Wikipedia page). The tagged documents were concatenated into several large files for ease of manipulation. Each file contains several thousand entries (xxx); entries are never split across files.

Document IDs: The DOCUMENT ID xxx is a string which corresponds to xxx. Wikipedia does not provide mappings from these IDs to current Wikipedia URLs, because these URL‘s may change over time. However, for almost all entries you can reach the current url by seraching on wikipedia with the entry title (first sentence), concatenating tokens with “%20” .

Dictionaries and Graph Files

The entities.dump file contains information about which entity appears in which sentence, for the entire collection. Use this file if you would like to build containment graphs and you don’t care about the actual Wikipedia text, only the entities.

Binary Graph Files

Sorry, we are not providing these for now... if you build them and you would like us to distribute them, we’ll be happy to.

Previous Releases

  • SW0 : We no longer distribute SW0 as it was an alpha release and contained bugs. If you need it for some weird reason, contact us.
 
semantically_annotated_snapshot_of_wikipedia.txt · Last modified: 2010/04/08 15:12 by hugoz