华南预防医学 ›› 2026, Vol. 52 ›› Issue (5): 547-553.doi: 10.12183/j.scjpm.2026.0547

• 论著 • 上一篇    下一篇

基于电子健康档案的体检人群肺癌发生风险预测模型研究

冯月亮   

  1. 首都医科大学附属北京胸科医院,北京 101149
  • 收稿日期:2025-12-31 出版日期:2026-05-20 发布日期:2026-06-05
  • 通讯作者: 冯月亮,E-mail:Fenglueliang000@163.com
  • 作者简介:郭倩(1989—),女,大学本科,主管护师,研究方向为肿瘤内科

A risk prediction model for lung cancer incidence using electronic health records from a health screening cohort

Guo Qian, Liu Rongmei, Song Yao, Zhang Mengjiao, Feng Yueliang   

  1. Beijing Chest Hospital, Capital Medical University, Beijing 101149, China
  • Received:2025-12-31 Online:2026-05-20 Published:2026-06-05

摘要: 目的 基于电子病历及影像数据,构建体检人群肺癌发生风险预测模型。方法 收集2022年1月至2024年6月于北京胸科医院行计算机体层摄影(CT)检查存在肺结节且疑似肺癌的2 567名体检者资料,按7∶3匹配为训练集(1 797例)和验证集(770例)。训练集中根据肺癌发生情况分为发生组(247例)与未发生组(1 550例),采用单因素及Lasso-logistic回归筛选影响因素,构建列线图模型并验证。结果 年龄≥60岁(OR=5.081)、吸烟史(OR=6.026)、肺癌家族史(OR=3.669)、结节最大直径5~<30 mm(OR=1.613、6.330)、结节类型(实性OR=0.368,部分实性OR=2.548)、结节边界(毛糙OR=2.526,毛刺征OR=10.175)、血清CEA≥5.0 ng/mL(OR=5.044)是肺癌发生的独立危险因素(P<0.05)。训练集C-index为0.891(95% CI:0.867~0.915),验证集为0.865(95% CI:0.825~0.905);训练集AUC为0.875(95% CI:0.851~0.899),验证集为0.863(95% CI:0.823~0.902),校准及决策曲线均显示模型良好。根据风险得分分层,高危组(102例)肺癌发生率显著高于中危组(509例)和低危组(1 186例)(P<0.01)。结论 该列线图模型基于上述7个影响因素构建,区分度、校准度及临床净获益良好,为肺癌高危个体的早期识别与分层管理提供了有效工具。

关键词: 肺肿瘤, 危险因素, 列线图, 体层摄影术, 螺旋计算机, 癌胚抗原, 吸烟

Abstract: Objective To develop and validate a risk prediction model for lung cancer incidence in a health screening cohort using electronic medical records and imaging data. Methods A retrospective study was conducted on 2 567 individuals who underwent computed tomography (CT) scans at Beijing Chest Hospital between January 2022 and June 2024, all of whom presented with pulmonary nodules and were suspected of having lung cancer. The cohort was randomly allocated into a training set (n=1 797) and a validation set (n=770) in a 7:3 ratio. Within the training set, participants were categorized into a cancer-positive group (n=247) and a cancer-negative group (n=1,550) based on the eventual diagnosis of lung cancer. Univariate analysis and LASSO-logistic regression were employed to screen for significant predictors, which were subsequently used to construct a nomogram model. The performance of the model was then validated. Results Multivariate analysis identified seven independent risk factors for lung cancer incidence (P<0.05): age ≥60 years (OR=5.081), history of smoking (OR=6.026), family history of lung cancer (OR=3.669), nodule maximum diameter of 5~<30 mm (OR=1.613、6.330), nodule types (OR=0.368、2.548), nodule margin (OR=2.526、10.175), and serum carcinoembryonic antigen (CEA) level ≥5.0 ng/mL (OR=5.044). The model demonstrated strong discrimination, with a Concordance Index (C-index) of 0.891 (95% CI: 0.867-0.915) in the training set and 0.865 (95% CI: 0.825-0.905) in the validation set. The area under the receiver operating characteristic curve (AUC) was 0.875 (95% CI: 0.851-0.899) for the training set and 0.863 (95% CI: 0.823-0.902) for the validation set. Calibration curves and decision curve analysis indicated good model fit and clinical utility. Stratification based on risk scores revealed that the incidence of lung cancer in the high-risk group (n=102) was significantly higher than that in the intermediate-risk (n=509) and low-risk (n=1 186) groups (P<0.01). Conclusion The nomogram, developed based on the seven aforementioned risk factors, exhibits excellent discrimination, calibration, and net clinical benefit. It serves as an effective tool for the early identification and stratified management of individuals at high risk for lung cancer.

Key words: Lung neoplasms, Risk factors, Nomograms, Tomography, Spiral computed, Carcinoembryonic antigen, Smoking

中图分类号: 

  • R73-31