Subject-verb order experiments

To use our scripts, clone this repository and then install the required libraries with

pip install -r requirements.txt

All relevant scripts have a help section, which you can call with the -h option, for instance

python add_frequencies_to_df.py -h

Models

We make use of the recent (December 2021) SOTA models by spaCy. Specifically the nl_udv25_dutchalpino_trf model, in part described here.

Before using our scripts, you should install it with the following command (or install from the requirements file):

python -m pip install https://huggingface.co/explosion/nl_udv25_dutchalpino_trf/resolve/main/nl_udv25_dutchalpino_trf-any-py3-none-any.whl

Data

In our research, we calculated frequencies on the SONAR corpus and limited ourselves to components that were written-to-be-read and published (WRP-). However, we excluded the WRPEA component, which contains data from discussion forums. Its data is riddled with non-standard, colloquial, slang, internet language text, which not only falls outside of the scope of our research objectives, but also makes the job of the parser very difficult (and results unpredictable).

Sentences shorter than three words (e.g. enumerations like "1 .") or longer than 32 words were excluded. The latter restriction for computational feasibility.

sv-order-2021

Commits

Create add_lemma_col.py

Create add_frequencies_to_df.py

save sents/toks processed and allow min/max

re-add torch/cupy check

use spawn as starting method to make cupy work in multiprocesses

move everything cupy-related to worker processes

README

Subject-verb order experiments

Models

Data