GitXplorerGitXplorer
B

sv-order-2021

public
0 stars
0 forks
0 issues

Commits

List of commits on branch main.
Unverified
0436a2fac7c2d247c5a7deb37b623b6ce8c121c6

fix tqdm to stick in one place

committed 3 years ago
Unverified
ccc07026986303e011efb404defc808596261782

Better calculation for n batches

committed 3 years ago
Verified
44712e6a9e89dd8f1e64e362902d3e798441c657

Merge pull request #1 from BramVanroy/multi-gpu

committed 3 years ago
Unverified
8c7dba7a09ba1bba904d035490bb4a83bbf30b56

Create extract_frequencies_from_corpus.py

committed 3 years ago
Unverified
9fd3e42be36628a21562107ac172c634292d8718

Delete extract_frequencies.py

committed 3 years ago
Unverified
62d288c81c095566d6941fb0b0b3f52fdaaf1fc3

Delete displacy.ipynb

committed 3 years ago

README

The README file for this repository.

Subject-verb order experiments

To use our scripts, clone this repository and then install the required libraries with

pip install -r requirements.txt

All relevant scripts have a help section, which you can call with the -h option, for instance

python add_frequencies_to_df.py -h

Models

We make use of the recent (December 2021) SOTA models by spaCy. Specifically the nl_udv25_dutchalpino_trf model, in part described here.

Before using our scripts, you should install it with the following command (or install from the requirements file):

python -m pip install https://huggingface.co/explosion/nl_udv25_dutchalpino_trf/resolve/main/nl_udv25_dutchalpino_trf-any-py3-none-any.whl

Data

In our research, we calculated frequencies on the SONAR corpus and limited ourselves to components that were written-to-be-read and published (WRP-). However, we excluded the WRPEA component, which contains data from discussion forums. Its data is riddled with non-standard, colloquial, slang, internet language text, which not only falls outside of the scope of our research objectives, but also makes the job of the parser very difficult (and results unpredictable).

Sentences shorter than three words (e.g. enumerations like "1 .") or longer than 32 words were excluded. The latter restriction for computational feasibility.