GitXplorerGitXplorer
a

bert-japanese-aozora

public
40 stars
3 forks
0 issues

Commits

List of commits on branch master.
Verified
e574f2be1570c503a3103a995f45bbba9aa11ef8

Merge pull request #6 from akirakubo/akirakubo-patch-1

aakirakubo committed 4 years ago
Verified
3f145a7e59e85f846542705e58853e137cda15f9

Merge pull request #5 from akirakubo/revert-4-akirakubo-patch-1

aakirakubo committed 4 years ago
Verified
1a5a8dc3f77511c36bae2ed0b204cf6692027879

Create LICENSE

aakirakubo committed 4 years ago
Verified
574918f0cb91938c9cb2ff3d1d2be3ce64db97a1

Revert "Fix [#3]"

aakirakubo committed 4 years ago
Verified
2b87cc058cfaabcdeae8330d59b0d900a890ebbf

Merge pull request #4 from akirakubo/akirakubo-patch-1

aakirakubo committed 4 years ago
Verified
7e03d5ac828d4136ecac52eadde1a09a17e204f9

Fix [#3]

aakirakubo committed 4 years ago

README

The README file for this repository.

Japanese BERT trained on Aozora Bunko and Wikipedia

This is a repository of Japanese BERT trained on Aozora Bunko and Wikipedia.

Features

  • We provide models trained on Aozora Bunko. We used works written both in contemporary Japanese kana spelling and in classical Japanese kana spelling.
  • Models trained on Aozora Bunko and Wikipedia are also available.
  • We trained models by applying different pre-tokenization methods (MeCab with UniDic and SudachiPy).
  • All models are trained with the same configuration as the bert-japanese (except for tokenization. bert-japanese uses SentencePiece unigram language model without pre-tokenization).
  • We provide models with 2M training steps.

Pretrained models

If you want to use models with 🤗 Transformers, see Converting Tensorflow Checkpoints.

When you use models, you will have to pre-tokenize datasets with the same morphological analyzer and the dictionary.

When you do fine-tuning tasks, you may want to modify official BERT codes or Transformers codes. BERT日本語Pretrainedモデル - KUROHASHI-KAWAHARA LAB will help you out.

BERT-base

After pre-tokenization, texts are tokenized by subword-nmt. Final vocab size is 32k.

Trained on Aozora Bunko (6M sentences)

Pre-tokenized by MeCab with unidic-cwj-2.3.0 and UniDic-qkana_1603

Pre-tokenized by SudachiPy with SudachiDict_core-20191224 and MeCab with UniDic-qkana_1603

Trained on Aozora Bunko (6M) and Japanese Wikipedia (1.5M)

Pre-tokenized by MeCab with unidic-cwj-2.3.0 and UniDic-qkana_1603

Pre-tokenized by SudachiPy with SudachiDict_core-20191224 and MeCab with UniDic-qkana_1603

Trained on Aozora Bunko (6M) and Japanese Wikipedia (3M)

Pre-tokenized by MeCab with unidic-cwj-2.3.0 and UniDic-qkana_1603

Pre-tokenized by SudachiPy with SudachiDict_core-20191224 and MeCab with UniDic-qkana_1603

Details of corpora

  • Aozora Bunko: Git repository as of 2019-04-21
    • git clone https://github.com/aozorabunko/aozorabunko and git checkout 1e3295f447ff9b82f60f4133636a73cf8998aeee.
    • We removed text files with 作品著作権フラグ = あり in index_pages/list_person_all_extended_utf8.zip.
  • Wikipedia (Japanese): XML dump as of 2018-12-20

Details of pretraining

Pre-tokenization

For each document, we identify kana spelling method and then pre-tokenize by using morphological analyzer with the dictionary associated with the spelling, i.e. unidic-cwj or SudachiDict-core is used for contemporary kana spelling, unidic-qkana is used for classical kana spelling.

In SudachiPy, we use split mode A ($ sudachipy -m A -a file) because it's equivalent to short unit word (SUW) in UniDic and unidic-cwj and unidic-qkana have only SUW mode.

After pre-tokenization, we concatenate texts of Aozora Bunko and random sampled Wikipedia (or only Aozora Bunko), and get vocabulary by using subword-nmt.

Identifying kana spelling

Wikipedia

We assume that contemporary kana spelling is used.

Aozora Bunko

index_pages/list_person_all_extended_utf8.zip has 文字遣い種別 column that is the information of kanji (旧字 or 新字) and kana spelling (旧仮名 or 新仮名). We use only kana spelling information.