South China Journal of Preventive Medicine ›› 2026, Vol. 52 ›› Issue (5): 547-553.doi: 10.12183/j.scjpm.2026.0547

• Original Article • Previous Articles     Next Articles

A risk prediction model for lung cancer incidence using electronic health records from a health screening cohort

Guo Qian, Liu Rongmei, Song Yao, Zhang Mengjiao, Feng Yueliang   

  1. Beijing Chest Hospital, Capital Medical University, Beijing 101149, China
  • Received:2025-12-31 Online:2026-05-20 Published:2026-06-05

Abstract: Objective To develop and validate a risk prediction model for lung cancer incidence in a health screening cohort using electronic medical records and imaging data. Methods A retrospective study was conducted on 2 567 individuals who underwent computed tomography (CT) scans at Beijing Chest Hospital between January 2022 and June 2024, all of whom presented with pulmonary nodules and were suspected of having lung cancer. The cohort was randomly allocated into a training set (n=1 797) and a validation set (n=770) in a 7:3 ratio. Within the training set, participants were categorized into a cancer-positive group (n=247) and a cancer-negative group (n=1,550) based on the eventual diagnosis of lung cancer. Univariate analysis and LASSO-logistic regression were employed to screen for significant predictors, which were subsequently used to construct a nomogram model. The performance of the model was then validated. Results Multivariate analysis identified seven independent risk factors for lung cancer incidence (P<0.05): age ≥60 years (OR=5.081), history of smoking (OR=6.026), family history of lung cancer (OR=3.669), nodule maximum diameter of 5~<30 mm (OR=1.613、6.330), nodule types (OR=0.368、2.548), nodule margin (OR=2.526、10.175), and serum carcinoembryonic antigen (CEA) level ≥5.0 ng/mL (OR=5.044). The model demonstrated strong discrimination, with a Concordance Index (C-index) of 0.891 (95% CI: 0.867-0.915) in the training set and 0.865 (95% CI: 0.825-0.905) in the validation set. The area under the receiver operating characteristic curve (AUC) was 0.875 (95% CI: 0.851-0.899) for the training set and 0.863 (95% CI: 0.823-0.902) for the validation set. Calibration curves and decision curve analysis indicated good model fit and clinical utility. Stratification based on risk scores revealed that the incidence of lung cancer in the high-risk group (n=102) was significantly higher than that in the intermediate-risk (n=509) and low-risk (n=1 186) groups (P<0.01). Conclusion The nomogram, developed based on the seven aforementioned risk factors, exhibits excellent discrimination, calibration, and net clinical benefit. It serves as an effective tool for the early identification and stratified management of individuals at high risk for lung cancer.

Key words: Lung neoplasms, Risk factors, Nomograms, Tomography, Spiral computed, Carcinoembryonic antigen, Smoking

CLC Number: 

  • R73-31