s

Part-of-Speech-Tagger

public

0 stars

0 forks

0 issues

Commits

List of commits on branch main.

Verified

8d6d1be7c41bd0df8729d0db1cafe2ef505af243

Update README.md

ssp1999 committed 4 years ago

Verified

8209b6f712cf45a4389662805e6562733b21cfe2

Add files via upload

ssp1999 committed 4 years ago

Verified

8a29e41a54a8354e5b772052251baa895923541b

Add files via upload

ssp1999 committed 4 years ago

Verified

04af69ae957a73fb174dadfed3eab19933973a55

Add files via upload

ssp1999 committed 4 years ago

Verified

18c56368cb6947a27af7581475926e31adc43434

Initial commit

ssp1999 committed 4 years ago

README

The README file for this repository.

Part-of-Speech-Tagger

Implemented a Part Of Speech Tagger using Support Vector Machine(SVM), Hidden Markov Model(HMM) and Bi-directional Long-Short Term Memory(Bi-LSTM)

All the results, detailed error analysis, strengths and weakness of the models and references are also present in the Report.pdf

Support Vector Machine

Instructions for running the code

SVM_PoS_tagger.ipynb contains the implementation of SVM

Run the code from https://colab.research.google.com/drive/178a6M4J3lt-twzGr1Nv-Y2EliW77NEt9?usp=sharing

It can also be run as a python script from svm.py in the Support Vector Machine directory.

No need to download any dependencies if running from the colab file.

Results

Per-POS accuracy vs Relative Frequency
Accuracy

Model Test Accuracy(%) Train Accuracy(%)

HMM 83.25 83.36

Feature Engineering

Features Selected	Accuracy(%)
Word length, capitalisation, upper-case, lower-case, isNumeric	40
Prefix and suffix for nouns, verbs, adjectives and adverbs	55
Word stems using PorterStemmer	60
Tag of Previous Word	65
Pre-trained Glove word-embeddings (1 lakh 50-dimensional vectors)	83
Pre-trained word2vec embeddings (10 crore 300-dimensional vectors)	90
Including the features of previous 3 words and following 3 words	95

Hidden Markov Model

Instructions for running the code

The file main.ipynb contains the implementation of HMM. Use jupyter notebook to access it.

Following utilities are to be installed: nltk, pandas, seaborn, matplotlib, sklearn, scikit-learn, tqdm

Also make sure to input the following commands in jupyter notebook:
```
nltk.download('brown')
nltk.download('universal_tagset')
```
Hidden Markov Model directory has a copy of all the images, plots and data produced from main.ipynb
Results
- Per-POS accuracy vs Relative Frequency
- Accuracy
  
  Model Test Accuracy(%) Train Accuracy(%)
  
  HMM 96.01 97.35

Bi-LSTM

Baseline Model Architecture

Layer (type) Output Shape # Param

embedding (Embedding) (None, 180, 300) 14944800

bidirectional (None, 180, 64) 85248

time_distributed (None, 180, 13) 845

Total params: 15,030,893
Trainable params: 86,093
Non-trainable params: 14,944,800
Instructions for running the code

BiLSTMBaseline.ipynb contains the implementation of baseline code for HMM

Run the baseline code from https://colab.research.google.com/drive/1lhBd-gxsXNVeQJtBXoLZ7HABI5yYttrw?usp=sharing

The CNN based code is available in the following two formats:
1. CNNBiLSTMCRF.ipynb for jupyter notebooks and
2. main.py can be run as a python script
The following utilities are to be installed for the CNN based code: pytorch, nltk, torchvision, numpy, seaborn, pandas, matplotlib, tqdm, scikit-learn, sklearn

Also download the glove embedding from http://nlp.stanford.edu/data/wordvecs/glove.6B.zip and extract it in the source directory:

Execute the following commands for the CNN based code
```
nltk.download('brown')
nltk.download('universal_tagset')
```
The results of running the CNN based code are stored in the Bi-LSTM folder.
Results
- Per-POS accuracy vs Relative Frequency
- Accuracy
  
  Model Test Accuracy(%) Train Accuracy(%)
  
  Bi-LSTM Baseline 79.38 78.39
  
  Bi-LSTM-CNN 87.15 87.19

References

Leon Bottou “UNE APPROCHE THEORIQUE DE L’APPRENTISSAGE CONNEXIONNISTE ET AP-PLICATIONS A LA RECONNAISSANCE DE LA PAROLE” PhD thesis (1991)
Xuezhe Ma and Eduard Hovy “End-to-end sequence labeling via bi-directional lstm-cnns-crf” (2016)
Jesus Gimenez and Llus Marquez "Fast and Accurate Part-of-Speech Tagging: The SVM Approach Revisited" (2003)
Mathieu Blondel, Akinori Fujino, Naonori Ueda Large-scale Multiclass Support Vector Machine Training via Euclidean Projection onto the Simplex (2014)