华南预防医学 ›› 2019, Vol. 45 ›› Issue (1): 26-31.doi: 10.13217/j.scjpm.2019.0026

• 论著 • 上一篇    下一篇

基于随机森林回归模型的登革热风险评估研究

黄宇琳1, 赵永谦1, 曹峥2, 刘涛3, 邓爱萍4, 肖建鹏3, 张兵3, 祝光湖3, 彭志强4, 马文军3   

  1. 1.暨南大学基础医学院,广东 广州 510632;
    2.广州大学地理科学学院;
    3.广东省疾病预防控制中心 广东省公共卫生研究院;
    4.广东省疾病预防控制中心
  • 收稿日期:2018-09-12 发布日期:2019-04-19
  • 通讯作者: 彭志强,E-mail:674699776@qq.com; 马文军,E-mail: mawj@gdiph.org.cn
  • 作者简介:黄宇琳(1993—),女,在读硕士研究生,主要研究方向:疾病预防与控制;赵永谦(1991—),女,硕士研究生,医师,主要研究方向:疾病预防与控制;黄宇琳、赵永谦同为第一作者
  • 基金资助:
    国家重点研发计划(2018YFB0505500,2018YFB0505503); 广东省科技计划项目(2014A040401041); 国家自然科学基金(81773497)

Risk assessment of dengue fever based on random forest model

HUANG Yu-lin1, ZHAO Yong-qian1, CAO Zheng2, LIU Tao3, DENG Ai-ping4, XIAO Jian-peng3, ZHANG Bing3, ZHU Guang-hu3, PENG Zhi-qiang4, MA Wen-jun3   

  1. 1.Jinan University Faculty of Medical Science, Guangzhou 510632, China;
    2.School of Geographical Sciences, Guangzhou University;
    3.Guangdong Provincial Institute of Public Health,Guangdong Provincial Center for Disease Control and Prevention;
    4.Guangdong Provincial Center for Disease Control and Prevention
  • Received:2018-09-12 Published:2019-04-19

摘要: 目的 基于随机森林回归模型构建小空间尺度的登革热风险评估工具,为登革热防控提供依据。方法 以2012年1月至2014年9月登革热病例及相关因素数据为训练集,分别构建登革热流行频率、持续时间及强度风险指标的随机森林回归模型,以2014年10月至2015年12月登革热病例及相关因素数据为验证集,并对构建的模型进行评估。结果 频率、持续时间、强度指标与发病数指标的相关系数均>0.7。依据训练集构建的登革热流行频率、持续时间和强度风险指标的随机森林回归模型变量解释度分别为96.72%、91.98%和90.1%,提示模型拟合度较好;交叉验证法可见各模型均方误差分别0.001 9、1.424 6和1.881 1,均处于较低水平;比较随机森林回归、支持向量回归、广义线性模型和广义相加模型的准确性,随机森林回归和支持向量机等机器学习模型均方误差远低于广义线性模型和广义相加模型。结论 以登革热频率、持续时间及强度指标为结局变量,气象、环境及社会经济特征为预测变量构建的随机森林回归模型准确性较好,可作为登革热风险评估工具,为登革热防控工作服务。

关键词: 登革热, 随机森林回归, 风险评估

Abstract: Objective To construct a small spatial scale dengue risk assessment tool based on the random forest model,so as to provide scientific basis for the prevention and control of dengue fever. Methods Data of dengue case and related factors from February 2012 to September 2014 were used as the training set and random forest regression (RFR) models were constructed separately for frequency, duration and intensity of dengue fever. Data of dengue cases and related factors from October 2014 to March 2015 were used to as the testing set to verify the accuracy of the models. Results The correlation coefficients between incidence and frequency, duration, intensity of dengue fever were all higher than 0.7. Based on the training set, the pseudo R-squareds in the models of frequency, duration, and intensity were 96.72%, 91.98%, and 90.1%; the cross-validated mean square errors (MSEs) of the models were 0.001 9, 1.424 6, and 1.881 1, respectively. By comparing the accuracy of RFR, support vector regression (SVR), generalized linear model (GLM) and generalized additive model (GAM), the MSEs of RFR and SVR were much lower than those of GLM and GAM. Conclusion The RFR models constructed using the frequency, duration and intensity of dengue fever as outcome variables and the meteorological, environmental and socioeconomic characteristics as predictors have better accuracy and can be used as a risk assessment tool for preventing and control of the outbreak of dengue fever.

Key words: Dengue, Random forest regression, Risk assessment

中图分类号: 

  • R183.5