GitXplorerGitXplorer
B

sv-order-2021

public
0 stars
0 forks
0 issues

Commits

List of commits on branch main.
Unverified
73e797d766291c1b74b736a23c5756aeb80d4df1

Create add_lemma_col.py

committed 3 years ago
Unverified
ad69cbb22c591934d8e8e97ff7ed0c18a7eee8f7

Create add_frequencies_to_df.py

committed 3 years ago
Unverified
60f80d44f68a7012cf77ef07d4c9ecb0b8830b78

save sents/toks processed and allow min/max

committed 3 years ago
Unverified
16904ef9b039a988760cfce018ef16f8163f5a11

re-add torch/cupy check

committed 3 years ago
Unverified
a8a22815bed0fb8a0a031edaa68d8fe3ce3d0018

use spawn as starting method to make cupy work in multiprocesses

committed 3 years ago
Unverified
11cc212028367ce703df99a69264c8f3ae3252ad

move everything cupy-related to worker processes

committed 3 years ago

README

The README file for this repository.

Subject-verb order experiments

To use our scripts, clone this repository and then install the required libraries with

pip install -r requirements.txt

All relevant scripts have a help section, which you can call with the -h option, for instance

python add_frequencies_to_df.py -h

Models

We make use of the recent (December 2021) SOTA models by spaCy. Specifically the nl_udv25_dutchalpino_trf model, in part described here.

Before using our scripts, you should install it with the following command (or install from the requirements file):

python -m pip install https://huggingface.co/explosion/nl_udv25_dutchalpino_trf/resolve/main/nl_udv25_dutchalpino_trf-any-py3-none-any.whl

Data

In our research, we calculated frequencies on the SONAR corpus and limited ourselves to components that were written-to-be-read and published (WRP-). However, we excluded the WRPEA component, which contains data from discussion forums. Its data is riddled with non-standard, colloquial, slang, internet language text, which not only falls outside of the scope of our research objectives, but also makes the job of the parser very difficult (and results unpredictable).

Sentences shorter than three words (e.g. enumerations like "1 .") or longer than 32 words were excluded. The latter restriction for computational feasibility.