Subject-verb order experiments

To use our scripts, clone this repository and then install the required libraries with

pip install -r requirements.txt

All relevant scripts have a help section, which you can call with the -h option, for instance

python add_frequencies_to_df.py -h

Models

We make use of the recent (December 2021) SOTA models by spaCy. Specifically the nl_udv25_dutchalpino_trf model, in part described here.

Before using our scripts, you should install it with the following command (or install from the requirements file):

python -m pip install https://huggingface.co/explosion/nl_udv25_dutchalpino_trf/resolve/main/nl_udv25_dutchalpino_trf-any-py3-none-any.whl

Data

In our research, we calculated frequencies on the SONAR corpus and limited ourselves to components that were written-to-be-read and published (WRP-). However, we excluded the WRPEA component, which contains data from discussion forums. Its data is riddled with non-standard, colloquial, slang, internet language text, which not only falls outside of the scope of our research objectives, but also makes the job of the parser very difficult (and results unpredictable).

Sentences shorter than three words (e.g. enumerations like "1 .") or longer than 32 words were excluded. The latter restriction for computational feasibility.

sv-order-2021

Commits

fix tqdm to stick in one place

Better calculation for n batches

Merge pull request #1 from BramVanroy/multi-gpu

Create extract_frequencies_from_corpus.py

Delete extract_frequencies.py

Delete displacy.ipynb

README

Subject-verb order experiments

Models

Data