GitXplorerGitXplorer
d

wikiedits

public
0 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
098e5a8f7bfa635d4ff60fdb93ba8feb8de43aa5

modified to output to stdout edit metadata, changed some var names

ddavidefiocco committed 7 years ago
Unverified
6283d6ff7bdd96100860be30626fb277c6b84ae8

Add new links to the WikEd Error Corpus

ssnukky committed 8 years ago
Unverified
0e16b484e79695c57c45ae2d074a982a19d139e5

Add conversion scripts

ssnukky committed 8 years ago
Unverified
e89a15a92b92dae6c74b36f543729fb427e13689

Delete trailing spaces

ssnukky committed 9 years ago
Unverified
0f7264ee19f2ff4d9f70488fd89226f95c8680b1

Fix handling lxml exceptions

ssnukky committed 9 years ago
Unverified
d5b08fc51961a68ca27bbd43fc72cf1e7d5fbb05

Add option for output file

ssnukky committed 9 years ago

README

The README file for this repository.

Wiki Edits 2.0

A collection of scripts for automatic extraction of edited sentences from text edition histories, such as Wikipedia revisions. It was used to create the WikEd Error Corpus --- a corpus of corrective Wikipedia edits published in:

@inproceedings{wiked2014,
    author = {Roman Grundkiewicz and Marcin Junczys-Dowmunt},
    title = {The WikEd Error Corpus: A Corpus of Corrective Wikipedia Edits and its Application to Grammatical Error Correction},
    booktitle = {Advances in Natural Language Processing -- Lecture Notes in Computer Science},
    editor = {Adam Przepiórkowski and Maciej Ogrodniczuk},
    publisher = {Springer},
    year = {2014},
    volume = {8686},
    pages = {478--490},
    url = {http://emjotde.github.io/publications/pdf/mjd.poltal2014.draft.pdf}
}

WikEd Error Corpus

The corpus has been prepared for two languages:

The repository contains some useful conversion scripts for the WikEd, which work independently from Wiki Edits. These can be found in bin directory.

Requirements

This is a new version of the library and it is not compatible with the old version! Back to commit 163d771 if you need old scripts.

This package is tested on Ubuntu with Python 2.7.

Required python packages:

Optional packages:

Run tests by typing nosetests from main directory.

Installation

Installation of all requirements is possible via Makefile if you have pip installed:

sudo apt-get install python-pip
sudo make all

Usage

Example usage from main directory:

./bin/txt_edits.py tests/data/lorem_ipsum.old.txt tests/data/lorem_ipsum.new.txt

And with Wikipedia dump file:

zcat tests/data/enwiki-20140102.tiny.xml.gz | ./bin/wiki_edits.py

The last script in the bin directory can be run with a list of dump files or URLs:

./bin/collect_wiki_edits.py -w /path/to/work/dir dumplist.txt

Language-specific options

All scripts are mostly language-independent. A few components need to be updated to run the scripts for non-English languages:

  • model for NLTK punkt tokenizer, see: https://github.com/nltk/nltk_data/tree/gh-pages
  • regular expressions for filtering reverted revisions, see file: wikiedits/wiki/__init__.py
  • list of supported languages, see file: wikiedits/__init__.py

Currently supported languages: English, Polish.