GitXplorerGitXplorer
a

TR-NLP-workshop

public
12 stars
3 forks
0 issues

Commits

List of commits on branch master.
Unverified
ffd79856453cf73f456a40d833f8fde988f298fe

update slides

committed 5 years ago
Unverified
1d478c1a232f0c149f34217494f6585d5fce5f27

update slides

committed 5 years ago
Verified
136cb97783398840d028a0f2b171adcca206f5d3

Update README.md

aalaradirik committed 5 years ago
Unverified
d1decb1af1dd64951bffa1d2b1ea2807b53351cf

update README

committed 5 years ago
Unverified
dc207178861698ff1cabeeb4b882fe628bc0474e

update slides

committed 5 years ago
Verified
c63c87db36c9f4b85aa7d66f9c73606792b39d49

Update README.md

aalaradirik committed 5 years ago

README

The README file for this repository.

Tr-NLP Workshop

Açık Seminer 2020 - Turkish NLP Seminar and Workshop

This repo includes the notebooks and slides for the Turkish Natural Language workshop. The implemented modules are:

  • Text preprocessing
  • Named Entity Recognition with SpaCy
  • Unsupervised text classification with K-Means

Dataset

TWNERTC (Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset ) by Sahin, et al. is used for Named Entity Recognition. The TWNERTC dataset contains approximately 300K named entities in 77 domains with more than 1000 fine-grained entity types. A subset of the dataset (the astronomy domain) is provided in the repo and the full clean version of the dataset in json format can be downloaded here.

JSON schema

[ {
    TOPIC_1: {
        SENTENCE_1: {
            "entities": [
                [
                    START_INDEX,
                    END_INDEX,
                    ENTITY_LABEL
                ], ...
            ]
        },
        SENTENCE_2: {...}
    TOPIC_2 : {...}
 } ]

A small Turkish news dataset crawled from various news websites is used for text clustering. This dataset contains news in 5 categories (economy, arts, politics, sports, technology) with 100 samples per category.

Notebooks

Clone the repo and install the requirements before running the notebooks:

git clone https://github.com/alaradirik/TR-NLP-workshop.git

pip install -r requirements.txt