基于ChineseBert的中文拼写纠错方法

doi:10.13232/j.cnki.jnju.2023.02.013

南京大学学报(自然科学版) ›› 2023, Vol. 59 ›› Issue (2): 302–312.doi: 10.13232/j.cnki.jnju.2023.02.013

基于ChineseBert的中文拼写纠错方法

崔凡, 强继朋(), 朱毅, 李云

扬州大学信息工程学院，扬州，225127

收稿日期:2022-11-14 出版日期:2023-03-31 发布日期:2023-04-07
通讯作者: 强继朋 E-mail:jpqiang@yzu.edu.cn
基金资助:
国家自然科学基金(62076217);扬州大学“青蓝工程”

Chinese spelling correction method based on ChineseBert

Fan Cui, Jipeng Qiang(), Yi Zhu, Yun Li

School of Information Engineering，Yangzhou University，Yangzhou，225127，China

Received:2022-11-14 Online:2023-03-31 Published:2023-04-07
Contact: Jipeng Qiang E-mail:jpqiang@yzu.edu.cn

摘要/Abstract

摘要：

中文拼写错误主要集中在拼音相似和字形相似两个方面，而通用的预训练语言模型只考虑文本的语义信息，忽略了中文的拼音和字形特征.最新的中文拼写纠错（Chinese Spelling Correction，CSC）方法在预训练模型的基础上利用额外的网络来融入拼音和字形特征，但和直接微调预训练模型相比，改进的模型没有显著提高模型的性能，因为由小规模拼写任务语料训练的拼音和字形特征，和预训练模型获取的丰富语义特征相比，存在严重的信息不对等现象.将多模态预训练语言模型ChineseBert应用到CSC问题上，由于ChineseBert已将拼音和字形信息放到预训练模型构建阶段，基于ChineseBert的CSC方法不仅无须构建额外的网络，还解决了信息不对等的问题.由于基于预训练模型的CSC方法普遍不能很好地处理连续错误的问题，进一步提出SepSpell方法.首先利用探测网络检测可能错误的字符，再对可能错误的字符保留拼音特征和字形特征，掩码对应的语义信息进行预测，这样能降低预测过程中错误字符带来的干扰，更好地处理连续错误问题.在三个官方评测数据集上进行评估，提出的两个方法都取得了非常不错的结果.

关键词: 中文拼写纠错, Bert, ChineseBert, 多模态语言模型

Abstract:

Chinese spelling errors mainly focuse on both phonetic and glyph similar. General pretrained language models only consider the semantic information of the text，ignoring the Chinese phonetic and glyph features. The latest Chinese Spelling Correction (CSC) methods incorporate pinyin and glyph features via additional networks on the basis of the pretrained language models. Compared with fine?tuning pretrained model directly，the improved model does not significantly improve the performance of CSC task. Because of the phonetic and glyphic features trained by the small?scale spelling task corpus，there is a serious information asymmetry compared with the rich semantic features obtained by the pre?training model. To betterly solve the information asymmetry，this paper tries to apply the multimodal pre?training language model ChineseBert to the CSC problem. Since ChineseBert combines phonetic and glyph information into the pre?training model building stage，CSC based on ChineseBert not only needn't to build additional networks，but also solve the problem of information asymmetry. The CSC method based on the pretrained model generally cannot deal with continuous errors very well. Therefore，we propose a novel method SepSpell，which firstly uses the probing network to detect potentially incorrect characters，and preserves the phonetic and glyphic features of the characters that may be incorrect to predict the corresponding semantic information of the mask. SepSpell reduces the interference caused by incorrect characters during the prediction process，so as to better handle the problem of continuous errors. Evaluating on three official evaluation datasets prove both methods with very good results.

Key words: Chinese Spelling Correction, Bert, ChineseBert, multimodal pretrained modeling

中图分类号:

TP391.1

崔凡, 强继朋, 朱毅, 李云. 基于ChineseBert的中文拼写纠错方法[J]. 南京大学学报(自然科学版), 2023, 59(2): 302–312.

Fan Cui, Jipeng Qiang, Yi Zhu, Yun Li. Chinese spelling correction method based on ChineseBert[J]. Journal of Nanjing University(Natural Sciences), 2023, 59(2): 302–312.

图/表 12

图1

图2

图3

表1

表2

表3

图4

图5

表4

表5

表6

表7

参考文献 25

1	Gao J F， Li X L， Micol D，et al. A large scale ranker?based system for search query spelling correction∥Proceedings of the 23rd International Conference on Computational Linguistics. Beijing，China：Association for Computational Linguistics，2010：358-366.
2	Hong Y Z， Yu X G， He N，et al. FASPell：A fast，adaptable，simple，powerful Chinese spell checker based on DAE?decoder paradigm∥Proceedings of the 5^th Workshop on Noisy User?generated Text. HongKong，China：Association for Computational Linguistics，2019：160-169.
3	Burstein J， Chodorow M. Automated essay scoring for nonnative English speakers∥Proceedings of a Symposium on Computer Mediated Language Assessment and Evaluation in Natural Language Processing. College Park，Maryland：Association for Computational Linguistics，1999：68-75.
4	Xie W J， Huang P J， Zhang X R，et al. Chinese spelling check system based on N?gram model∥Proceedings of the 8^th SIGHAN Workshop on Chinese Language Processing. Beijing，China：Association for Computational Linguistics，2015：128-136.
5	Yeh J F， Li S F， Wu M R，et al. Chinese word spelling correction based on n?gram ranked inverted index list∥Proceedings of the 7^th SIGHAN Workshop on Chinese Language Processing. Nagoya，Japan：Asian Federation of Natural Language Processing，2013：43-48.
6	Yu J J， Li Z H. Chinese spelling error detection and correction based on language model，pronunciation，and shape∥Proceedings of The 3rd CIPS?SIGHAN Joint Conference on Chinese Language Processing. Wuhan，China：Association for Computational Linguistics，2014：220-223.
7	Devlin J， Chang M W， Lee K，et al. Bert：Pre?training of deep bidirectional transformers for language understanding∥Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，Volume 1 (Long and Short Papers). Minneapolis，MI，USA：Association for Computational Linguistics，2019：4171-4186.
8	Cui Y M， Che W X， Liu T，et al. Pre?training with whole word masking for Chinese BERT. IEEE/ACM Transactions on Audio，Speech，and Language Processing，2021(29)：3504-3514.
9	Liu C L， Lai M H， Tien K W，et al. Visually and phonologically similar characters in incorrect Chinese words：Analyses，identification，and applications. ACM Transactions on Asian Language Information Processing，2011，10(2)：10.
10	Cheng X Y， Xu W D， Chen K L，et al. SpellGCN：Incorporating phonological and visual similarities into language models for Chinese spelling check∥Proceedings of the 58^th Annual Meeting of the Association for Computational Linguistics. Seatle，Washington DC，USA：Association for Computational Linguistics，2020：871-881.
11	Xu H D， Li Z L， Zhou Q Y，et al. Read，listen，and see：Leveraging multimodal information helps Chinese spell checking∥Findings of the Association for Computational Linguistics. Bangkok, Thailand：Association for Computational Linguistics，2021：716-728.
12	Huang L， Li J J， Jiang W W，et al. PHMOSpell：Phonological and morphological knowledge guided Chinese spelling check∥Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1：Long Papers). Bangkok, Thailand：Association for Computational Linguistics，2021：5958-5967.
13	Sun Z J， Li X Y， Sun X F，et al. ChineseBERT：Chinese pretraining enhanced by glyph and pinyin information∥Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1：Long Papers). Bangkok, Thailand：Association for Computational Linguistics，2021：2065-2075.
14	Wang B X， Che W X， Wu D Y，et al. Dynamic connected networks for Chinese spelling check∥Findings of the Association for Computational Linguistics. Bangkok, Thailand：Association for Computational Linguistics，2021：2437-2446.
15	Clark K， Luong M T， Le Q V，et al. Electra：Pre?training text encoders as discriminators rather than generators∥The 8th International Conference on Learning Representations. Addis Ababa，Ethiopia：OpenReview.net，2020，.
16	Wu S H， Liu C L， Lee L H. Chinese spelling check evaluation at SIGHAN bake?off 2013∥Proceedings of the 7^th SIGHAN Workshop on Chinese Language Processing. Nagoya，Japan：Asian Federation of Natural Language Processing，2013：35-42.
17	Yu L C， Lee L H， Tseng Y H，et al. Overview of SIGHAN 2014 bake?off for Chinese spelling check∥Proceedings of the 3rd CIPS?SIGHAN Joint Conference on Chinese Language Processing. Wuhan，China：Association for Computational Linguistics，2014：126-132.
18	Tseng Y H， Lee L H， Chang L P，et al. Introduction to SIGHAN 2015 bake?off for Chinese spelling check∥Proceedings of the 8^th SIGHAN Workshop on Chinese Language Processing. Beijing，China：Association for Computational Linguistics，2015：32-37.
19	Rao G Q， Gong Q， Zhang B l，et al. Overview of NLPTEA?2018 share task Chinese grammatical error diagnosis∥/Proceedings of the 5^th Workshop on Natural Language Processing Techniques for Educational Applications. Melbourne，Australia：Association for Computational Linguistics，2018：42-51.
20	Fu R J， Pei Z Q， Gong J F，et al. Chinese grammatical error diagnosis using statistical and prior knowledge driven features with probabilistic ensemble enhancement∥Proceedings of the 5^th Workshop on Natural Language Processing Techniques for Educational Applications. Melbourne，Australia：Association for Computational Linguistics，2018：52-59.
21	Wang D M， Song Y， Li J，et al. A hybrid approach to automatic corpus generation for Chinese spelling check∥Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels，Belgium：Association for Computational Linguistics，2018：2517-2527.
22	Wang D M， Tay Y， Zhong L. Confusionset?guided pointer networks for Chinese spelling check∥Proceedings of the 57^th Annual Meeting of the Association for Computational Linguistics. Florence，Italy：Association for Computational Linguistics，2019：5780-5785.
23	Zhang S H， Huang H R， Liu J C，et al. Spelling error correction with soft?masked BERT∥Proceedings of the 58^th Annual Meeting of the Association for Computational Linguistics. Seatle，Washington DC，USA：Association for Computational Linguistics，2020：882-890.
24	Guo Z， Ni Y， Wang K Q，et al. Global attention decoder for Chinese spelling error correction∥Findings of the Association for Computational Linguistics.Bangkok, Thailand： Association for Computational Linguistics，2021：1419-1428.
25	Wang S， Shang L. Improve Chinese spelling check by reevaluation∥The26^th Pacific?Asia Conference on Knowledge Discovery and Data Mining. Springer Berlin Heidelberg，2022：237-248.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Train Set	#Sent	Avg.Length	#Errors
（wang）	271329	44.4	271329
SIGHAN2013	700	49.2	350
SIGHAN2014	3435	49.7	3432
SIGHAN2015	2339	30.0	2339
Test Set	#Sent	Avg.Length	#Errors
SIGHAN2013	1000	74.1	996
SIGHAN2014	1062	50.1	529
SIGHAN2015	1100	30.5	550

	Character Level						Sentence Level
	Detection Level			Correction Level			Detection Level				Correction Level
SIGHAN2013	P	R	F	P	R	F	Acc	P	R	F	Acc	P	R	F
SpellGCN	82.6%	88.9%	85.7%	98.4%	88.4%	93.1%	(-)	80.1%	74.4%	77.2%	(-)	78.3%	72.7%	75.4%
REALISE	(-)	(-)	(-)	(-)	(-)	(-)	82.7%	88.6%	82.5%	85.4%	81.4%	87.2%	81.2%	84.1%
DCN	(-)	(-)	(-)	(-)	(-)	(-)	(-)	86.8%	79.6%	83.0%	(-)	84.7%	77.7%	81.0%
Roberta	80.5%	88.0%	84.1%	98.0%	86.5%	91.9%	77.3%	85.1%	76.9%	80.8%	75.6%	83.6%	76.0%	79.6%
ChineseBert	79.4%	91.2%	84.9%	98.1%	95.3%	96.7%	81.4%	85.6%	81.3%	83.4%	80.0%	84.1%	79.9%	81.9%
SepSpell	78.9%	91.4%	84.7%	98.4%	95.4%	96.9%	83.9%	88.5%	84.0%	86.2%	82.7%	87.2%	82.8%	84.9%
SIGHAN2014	P	R	F	P	R	F	Acc	P	R	F	Acc	P	R	F
SpellGCN	83.6%	78.6%	81.0%	97.2%	76.4%	85.5%	(-)	65.1%	69.5%	67.2%	(-)	63.1%	67.2%	65.3%
REALISE	(-)	(-)	(-)	(-)	(-)	(-)	78.4%	67.8%	71.5%	69.6%	77.7%	66.3%	70.0%	68.1%
DCN	(-)	(-)	(-)	(-)	(-)	(-)	(-)	67.4%	70.4%	68.9%	(-)	65.8%	68.7%	67.2%
Roberta	82.6%	78.0%	80.2%	96.9%	75.9%	85.1%	74.1%	61.2%	67.3%	64.1%	73.6%	60.3%	66.4%	63.2%
ChineseBert	80.3%	79.4%	79.8%	97.1%	88.4%	92.5%	77.1%	66.0%	68.1%	67.1%	76.4%	64.6%	66.5%	65.5%
SepSpell	79.9%	79.6%	79.8%	98.0%	89.2%	93.4%	78.3%	67.2%	71.2%	69.1%	77.5%	65.5%	69.4%	67.4%
SIGHAN2015	P	R	F	P	R	F	Acc	P	R	F	Acc	P	R	F
SpellGCN	88.9%	87.7%	88.3%	95.7%	83.9%	89.4%	(-)	74.8%	80.7%	77.7%	(-)	72.1%	77.7%	75.9%
REALISE	(-)	(-)	(-)	(-)	(-)	(-)	84.7%	77.3%	81.3%	79.3%	84.0%	75.9%	79.9%	77.8%
DCN	(-)	(-)	(-)	(-)	(-)	(-)	(-)	77.1%	80.9%	79.0%	(-)	74.5%	78.2%	76.3%
Roberta	86.9%	87.3%	87.1%	95.1%	82.0%	88.1%	82.9%	73.2%	80.4%	76.7%	81.7%	71.0%	78.0%	74.5%
ChineseBert	87.5%	87.6%	87.5%	96.1%	92.1%	94.0%	84.9%	77.1%	81.3%	79.1%	83.8%	75.0%	79.1%	77.0%
SepSpell	87.0%	86.5%	86.7%	97.3%	92.4%	94.8%	86.6%	81.7%	80.6%	81.1%	85.6%	79.6%	78.6%	79.1%

Method	Detection Level				Correction Level
Method	Acc	P	R	F	Acc	P	R	F
SpellGCN	83.7%	85.9%	80.6%	83.1%	82.2%	85.4%	77.6%	81.3%
DCN	84.6%	88.0%	80.2%	83.9%	83.2%	87.6%	77.3%	82.1%
Roberta	82.9%	84.1%	80.4%	82.2%	81.7%	83.7%	78.0%	80.8%
ChineseBert	84.9%	87.1%	81.3%	84.1%	83.8%	86.8%	79.1%	82.8%
SepSpell	86.6%	91.2%	80.6%	85.6%	85.6%	91.0%	78.6%	84.3%

	探测水平				校正水平
	Acc	P	R	F	Acc	P	R	F
Roberta	87.3%	91.3%	87.3%	89.3%	72.7%	76.0%	72.7%	74.3%
ChineseBert	85.2%	88.8%	85.2%	86.9%	75.5%	78.6%	75.5%	77.0%
SepSpell	88.8%	92.8%	88.8%	90.8%	81.6%	85.2%	81.6%	83.4%

Test Set	FASpell	Roberta	ChineseBert	SepSpell
SIGHAN2013	446	160.3	164.2	335.0
SIGHAN2014	284	132.5	138.0	266.2
SIGHAN2015	177	81.8	83.7	179.3

基于ChineseBert的中文拼写纠错方法

Chinese spelling correction method based on ChineseBert

RichHTML

PDF (PC)

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 25

相关文章 6

Metrics

本文评价

推荐阅读 0

[1]	朱伟,张帅,辛晓燕,李文飞,王骏,张建,王炜. 结合区域检测和注意力机制的胸片自动定位与识别[J]. 南京大学学报(自然科学版), 2020, 56(4): 591-600.
[2]	罗春春,郝晓燕. 基于双重注意力模型的微博情感倾向性分析[J]. 南京大学学报(自然科学版), 2020, 56(2): 236-243.
[3]	韩普,刘亦卓,李晓艳. 基于深度学习和多特征融合的中文电子病历实体识别研究[J]. 南京大学学报(自然科学版), 2019, 55(6): 942-951.
[4]	范　君, 业巧林, 业　宁. 基于线性鉴别的无参数局部保持投影算法[J]. 南京大学学报(自然科学版), 2019, 55(2): 211-220.
[5]	阚建飞, 任永峰, 翟继友, 董学育, 霍　瑛. 基于稀疏模型和Gabor小波字典的跟踪算法[J]. 南京大学学报(自然科学版), 2019, 55(1): 85-91.
[6]	严云洋, 瞿学新, 朱全银, 李　翔, 赵　阳. 基于离群点检测的分类结果置信度的度量方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 102-109.

Pronunciation：píng→píng
Src：…就主观立场来凭断任何一句人事物，….
Roberta：…就主观立场来判断任何一句人事物，….
ChineseBert：…就主观立场来评断任何一句人事物，….
SepSpell：…就主观立场来评断任何一句人事物，….
Shape：田→由
Src：…的文章田来经常都是下了很大的功夫…
Roberta：…的文章出来经常都是下了很大的功夫…
ChineseBert：…的文章由来经常都是下了很大的功夫…
SepSpell：…的文章由来经常都是下了很大的功夫…
Both：láng→lǎng；郎→朗
Src：…不好的事情也要开郎地度过…
Roberta：…不好的事情也要开心地度过…
ChineseBert：…好的事情也要开朗地度过…
SepSpell：…不好的事情也要开朗地度过…