GitXplorerGitXplorer
h

hanzi_char_featurizer

public
287 stars
56 forks
3 issues

Commits

List of commits on branch master.
Unverified
023bab252ed7b45136e14875a85abfcc2ccac276

fix: syntax error

hhowl-anderson committed 5 years ago
Unverified
5d68854f06b47dd71a24606cef51d53bf3222f20

build: add helper

hhowl-anderson committed 5 years ago
Unverified
e8038a771f4e57f3ca050d464696b3ce8dc11898

fix: missing dependency

hhowl-anderson committed 5 years ago
Verified
54d14b40e6df570bd0aeacb38071d2b7b7f97a90

Update README.md

hhowl-anderson committed 5 years ago
Verified
a044ede11123aefe6105c474f4105974e33e3a9e

Update README.md

hhowl-anderson committed 5 years ago
Unverified
4e1b81ea0f00e66746b5aa9ac51cc7fbd401d5d2

Update image

hhowl-anderson committed 6 years ago

README

The README file for this repository.

汉字字符特征提取器(featurizer)

在深度学习中,很多场合需要提取汉字的特征(发音特征、字形特征)。本项目提供了一个通用的字符特征提取框架,并内建了 拼音字形(四角编码) 和 部首拆解 的特征。

特征提取器

  • 拼音特征提取器:提取汉字的拼音作为特征,发音相似的字在编码上应该相似。示例: -> ->
  • 字形(四角编码)提取器:提取中文的外形作为特征,相似的汉字在编码上应该相近。示例: -> 37001 -> 37101
  • 部首拆解提取器:提取汉字的偏旁部首拆解作为特征,相似的汉字在编码上应该相近。示例: -> ['门', '一'] -> ['门', '三']

使用

from hanzi_char_featurizer import Featurizor

featurizor = Featurizor()
result = featurizor.featurize('明天')
print(result)

输出

([['m'], ['t']], [['ing'], ['ian']], [['2'], ['1']], ('6', '1'), ('7', '0'), ('0', '8'), ('2', '0'), ('0', '4'))

结构解析

输出到 TensorFlow 作为 Tensor

import tensorflow as tf

import hanzi_char_featurizer

feature = hanzi_char_featurizer.featurize_as_tensor('./usage/data.txt')

with tf.Session() as sess:
    sess.run(tf.initializers.tables_initializer())
    for _ in range(1):
        print('+' * 20)
        data = sess.run(feature)
        print(data)

输出

++++++++++++++++++++
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
  0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]

在使用 hanzi_char_featurizer 的公司列表



TODO

  • 增加 Unicode 的 IDS 表征,来自 爱奇艺 FASPell 模型