南京大学学报(自然科学版) ›› 2019, Vol. 55 ›› Issue (6): 942–951.doi: 10.13232/j.cnki.jnju.2019.06.007

• • 上一篇    下一篇

基于深度学习和多特征融合的中文电子病历实体识别研究

韩普1,2(),刘亦卓3,李晓艳4   

  1. 1. 南京邮电大学管理学院,南京,210023
    2. 江苏省数据工程与知识服务重点实验室,南京大学,南京,210023
    3. 南京大学信息管理学院,南京,210023
    4. 南京医科大学基础医学院,南京,211166
  • 收稿日期:2019-08-13 出版日期:2019-11-30 发布日期:2019-11-29
  • 通讯作者: 韩普 E-mail:hanpu@njupt.edu.cn
  • 基金资助:
    国家社会科学基金(17CTQ022)

Named entity recognition from Chinese medical records based ondeep learning and multi⁃feature fusion

Pu Han1,2(),Yizhuo Liu3,Xiaoyan Li4   

  1. 1. School of Management,Nanjing University of Posts & Telecommunications,Nanjing,210023,China
    2. Jiangsu Provincial Key Laboratory of Data Engineering and Knowledge Service,Nanjing University,Nanjing,210023,China
    3. School of Information Management,Nanjing University,Nanjing,210023,China
    4. School of Basic Medical Sciences, Nanjing Medical University, Nanjing, 211166, China
  • Received:2019-08-13 Online:2019-11-30 Published:2019-11-29
  • Contact: Pu Han E-mail:hanpu@njupt.edu.cn

摘要:

电子病历实体识别是医疗领域人工智能和医疗信息服务中非常关键的基础任务.为了更充分地挖掘电子病历中的实体语义知识以提升中文医疗实体识别效果,提出融入外部语义特征的中文电子病历实体识别模型.该模型首先利用语言模型word2vec将大规模的未标记文本生成具有语义特征的字符级向量,接着通过医疗语义资源的整合以及实体边界特征分析构建了医疗实体及特征库,将其与字符级向量相拼接以更好地挖掘序列信息,最后采用改进的Voting算法将深度学习结果与条件随机场(Conditional Random Fields,CRF)的结果加以整合来纠正标签偏置.实验表明,融入外部语义特征的改进模型的F值达到94.06%,较CRF高出1.55%.此外,还给出了模型最佳效果的各项参数.

关键词: 医疗实体识别, 深度学习, 电子病历, 人工神经网络

Abstract:

Named entity recognition of electronic medical record is a key basic task in the field of medical artificial intelligence and medical information service. In order to fully explore the semantic knowledge of medical entity in electronic medical records to improve the effect of Chinese medical entity recognition,BiLSTM(Bi?Long Short Term Memory)?withfea model incorporating external semantic features is proposed. The model first generates a character?level vector with semantic features using the large?scale unlabeled medical texts based on word2vec. Second,a medical entity and feature database for entity recognition is constructed according to the integration of the existing medical resources and the analysis of different types of entity boundary characteristics. Then,the database is converted into vectors to join with character?level vectors to fully exploit the sequence information. Finally,the experimental result is integrated with the output of the CRF (Conditional Random Fields) by the improved Voting algorithm to correct the label offset. Experiments show that the F?measure of the improved model reaches 94.06%,which is 1.55% higher than CRF. In addition,the parameters of the optimal effect of the improved model are given.

Key words: medical entity recognition, deep learning, electrical medical records, artificial neural network

中图分类号: 

  • TP391.1

图1

循环神经网络"

图2

RNN神经元"

图3

双向循环神经网络"

图4

LSTM神经元"

图5

改进的BiLSTM?CRF模型"

图6

整体实验流程"

图7

Skip?Gram模型"

表1

中文医疗实体及特征库"

类别 数量 例子
组织器官 349 直肠、甲状腺、双侧股动脉

医药及

手术后缀

32062

脉管复康片、锋泰灵

全切术、固定术、切除

疾病后缀 175 病、癌

图8

特征拼接流程"

表 2

实体数量统计"

实体类型 症状 疾病 器官 治疗 检查 全部实体
实体数量 11676 2418 5162 1429 3620 24305

表3

各模型的实验结果(%)"

模型 Precision Recall F?measure
CRF 94.09 90.98 92.51
BiLSTM?CRF 94.50 92.35 93.41
BiLSTM?withfea 96.04 92.16 94.06
Voting 95.53 92.38 93.93

表4

各类实体识别的F值(%)"

模型 症状 疾病 器官 治疗 检查 整体
CRF 92.95 96.82 95.94 89.18 87.45 92.51
BiLSTM?CRF 94.10 96.38 95.80 90.21 90.23 93.41
BiLSTM?withfea 94.37 96.80 97.18 90.30 89.21 94.06
Voting 94.21 96.58 95.94 93.44 90.85 93.93

表5

各模型中偏置实体数量及比例"

模型 偏置实体数 识别出实体数 偏置率
CRF 0 2410 0%
BiLSTM?CRF 156 2477 6.29%
BiLSTM?withfea 104 2502 4.15%
Voting 0 2389 0%

表6

不同词向量生成模型及维度的F值(%)"

模型 维度
100 300
Skip?Gram 93.22 93.62
CBOW 92.86 93.04

表7

不同深度学习参数的F值(%)"

learning?rate batch?size
64 128 256
0.01 92.67 92.89 92.73
0.001 93.31 93.62 93.40
0.0001 - 93.41 93.01

表8

最优深度学习参数(%)"

language?model vector?size learning?rate batch?size
Skip?Gram 300 0.001 128
1 Zhu F , Patumcharoenpol P , Zhang C ,et al . Biomedical text mining and its applications in cancer research. Journal of Biomedical Informatics,2013,46(2):200-211.
2 Wang Y S , Wang L W , Rastegar?Mojarad M ,et al . Clinical information extraction applications:a literature review. Journal of Biomedical Informatics,2018,77:34-49.
3 Wu Y H , Jiang M , Lei J B ,et al . Named entity recognition in Chinese clinical text using deep neural network. Studies in Health Technology and Informatics,2015,216:624-628.
4 Segura?Bedmar I , Suárez?Paniagua V , Martínez P . Exploring word embedding for drug name recognition∥International Workshop on Health Text Mining and Information Analysis. Lisbon,Portugal:Association for Computational Linguistics,2015:64-72.
5 叶枫,陈莺莺,周根贵 等 . 电子病历中命名实体的智能识别. 中国生物医学工程学报,2011,30(2):256-262.
Ye F , Chen Y Y , Zhou G G ,et al . Intelligent recognition of named entity in electronic medical records. Chinese Journal of Biomedical Engineering,2011,30(2):256-262.
6 Hu J L , Shi X , Liu Z J ,et al . HITSZ_CNER:a hybrid system for entity recognition from Chinese clinical text∥China Conference on Knowledge Graph and Semantic Computing 2017. Chendu,China:Springer,2017,26-29.
7 Keerthi S S , Sundararajan S . CRF versus SVM?struct for sequence labeling. Yahoo Research Technical Report,2007.
8 Jiang M , Chen Y K , Liu M ,et al . A study of machine?learning?based approaches to extract clinical entities and their assertions from discharge summaries. Journal of the American Medical Informatics Association,2011,18(5):601-606.
9 Lei J , Tang B , Lu X ,et al . A comprehensive study of named entity recognition in Chinese clinical text. Journal of the American Medical Informatics Association,2013,21(5):808-814.
10 王润奇,关毅 . 基于Tri?Training算法的中文电子病历实体识别研究. 智能计算机与应用,2017,7(6):132-134,138. (Wang R Q,Guan Y. Named entity recognition research in Chinese electronic medical records based on Tri?Training algorithm. Intelligent Computer and Applications,2017,7(6):132-134,138.)
11 付文博,孙涛,梁藉 等 . 深度学习原理及应用综述. 计算机科学,2018,45(6A):11-15,40.
Fu W B , Sun T , Liang J ,et al . Review of principle and application of deep learning. Computer Science,2018,45(6A):11-15,40.
12 Lample G , Ballesteros M , Subramanian S ,et al . Neural architectures for named entity recognition. arXiv:1603.01360,2016:260-270.
13 Liu Z J , Yang M , Wang X L ,et al . Entity recognition from clinical texts via recurrent neural network. BMC Medical Informatics and Decision Making,2017,17(2):67.
14 Ma X Z , Hovy E . End?to?end sequence labeling via bi?directional lstm?cnns?crf. 2016,arXiv:1603.01354.
15 Chalapathy R , Borzeshi E Z , Piccardi M . Bidirectional LSTM?CRF for clinical concept extraction. 2016,arXiv:1610.05858.
16 Wu J H , Hu X , Zhao R S ,et al . Clinical named entity recognition via bi?directional LSTM?CRF model∥China Conference on Knowledge Graph and Semantic Computing 2017. Chendu,Sichuan,2017,26-29.
17 Mikolov T , Kombrink S , Burget L ,et al . Extensions of recurrent neural network language model∥IEEE International Conference on Acoustics,Speech and Signal Processing. Prague,Czech Republic:IEEE,2011:5528-5531.
18 Hochreiter S . The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty,Fuzziness and Knowledge?Based Systems,1998,6(2):107-116.
19 Donahue J , Hendricks L A , Rohrbach M ,et al . Long?term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(4):677-691.
20 Gers F A , Schmidhuber J , Cummins F . Learning to forget:continual prediction with LSTM. Neural Computation,2014,12(10):2451-2471.
21 Habibi M , Weber L , Neves M ,et al . Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics,2017,33(14):i37-i48.
22 Lafferty J D , Mccallum A , Pereira F C N . Conditional random fields:Probabilistic models for segmenting and labeling sequence data∥Proceedings of the 18th International Conference on Machine Learning.San Francisco,CA,USA:Morgan Kaufmann Publishers Inc.,2001:282-289.
23 Goldberg Y , Levy O . word2vec Explained:Deriving Mikolovet al.'s negative?sampling word?embedding method. 2014,arXiv:1402.3722.
24 Srivastava N , Hinton G , Krizhevsky A ,et al . Dropout:a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research,2014,15(1):1929-1958.
25 Mikolov T , Chen K , Corrado G ,et al . Efficient estimation of word representations in vector space. 2013,arXiv:1301.3781.
[1] 朱伟,张帅,辛晓燕,李文飞,王骏,张建,王炜. 结合区域检测和注意力机制的胸片自动定位与识别[J]. 南京大学学报(自然科学版), 2020, 56(4): 591-600.
[2] 李康,谢宁,李旭,谭凯. 基于卷积神经网络和几何优化的统计染色体核型分析方法[J]. 南京大学学报(自然科学版), 2020, 56(1): 116-124.
[3] 张家精,夏巽鹏,陈金兰,倪友聪. 基于张量分解和深度学习的混合推荐算法[J]. 南京大学学报(自然科学版), 2019, 55(6): 952-959.
[4] 钟琪,冯亚琴,王蔚. 跨语言语料库的语音情感识别对比研究[J]. 南京大学学报(自然科学版), 2019, 55(5): 765-773.
[5] 王蔚, 胡婷婷, 冯亚琴. 基于深度学习的自然与表演语音情感识别[J]. 南京大学学报(自然科学版), 2019, 55(4): 660-666.
[6] 张鹏,黄毅,阮雅端,陈启美*. 基于稀疏特征的交通流视频检测算法[J]. 南京大学学报(自然科学版), 2015, 51(2): 264-270.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 陈丹阳1,张秋坤2,钟剑锋2,郭金泉2,钟舜聪2,3*,沈耀春5,姚立纲2. 基于频谱校正技术的光学相干振动和热变形层析系统研究[J]. 南京大学学报(自然科学版), 2014, 50(2): 167 .
[2] 邱 浩,王欣然*. 二硫化钼的电子输运与器件[J]. 南京大学学报(自然科学版), 2014, 50(3): 280 .
[3] 王学锋1,2*,徐永兵1,2*,张 荣1,2. 低维磁性耦合体系的新物性及电/光场调控进展[J]. 南京大学学报(自然科学版), 2014, 50(3): 309 .
[4] 骆乾坤*,吴剑锋2,杨运3,钱家忠1. 渗透系数空间变异程度对进化算法优化结果影响评价[J]. 南京大学学报(自然科学版), 2015, 51(1): 60 -66 .
[5] 孙大军1,2, 王永恒1,2*, 勇俊1,2. 实频数据技术在水声换能器宽带匹配中的应用[J]. 南京大学学报(自然科学版), 2015, 51(6): 1182 -1188 .
[6] 杨政予1,王新龙1
. 茅山军号声现象的进一步研究[J]. 南京大学学报(自然科学版), 2015, 51(6): 1097 -1106 .
[7] 张亚平,2,万宇1,2,聂青3,阮晓红1,2*,王子健4. 湖泊水体中氮的生物地球化学过程及其生态学意义[J]. 南京大学学报(自然科学版), 2016, 52(1): 5 -15 .
[8] 李荣富1,2,罗跃辉 ,2,曾洪玉1,2,阮晓红1,2*,刘丛强3*. 稳定同位素技术在环境水体氮的生物地球化学循环研究中的应用[J]. 南京大学学报(自然科学版), 2016, 52(1): 16 -26 .
[9] 李 婷1,张超智1,2*,沈 丹1,袁 阳1. 石墨烯和氧化石墨烯的生物体毒性研究进展[J]. 南京大学学报(自然科学版), 2016, 52(2): 235 .
[10] 涂 臻*,卢 晶 . 散射条件下小尺度扬声器阵列声聚焦算法鲁棒性研究[J]. 南京大学学报(自然科学版), 2016, 52(2): 382 .