Comparative Assessment of Machine Learning Approaches for Early Lung Cancer Diagnosis
Downloads
Lung cancer, a leading cause of cancer-related mortality worldwide, often escapes early detection due to the absence of distinct symptoms in its initial stages. This work investigates how Machine Learning (ML) might improve early diagnosis by analyzing Electronic Health Records (EHR) data. Multiple ML models were developed and evaluated on a synthetic dataset created to replicate real-world patient characteristics, allowing controlled experimentation while safeguarding privacy. Model performance was tuned using both conventional optimization methods and nature-inspired approaches, with the aim of balancing predictive accuracy and computational efficiency. In our synthetic dataset experiments, ensemble learners optimized with metaheuristic techniques reached accuracy levels approaching 99 percent while maintaining computational efficiency and generally outperformed simpler baselines. The contribution of this work lies in exploring the integration of GFO and WOA for feature selection and hyperparameter tuning of XGBoost, together with a soft-voting ensemble. This approach provides an experimental pathway for enhancing predictive performance under computational constraints. However, as the dataset is synthetic, the conclusion remains experimental; validation against clinical records will be essential before translation into practice.
Downloads
[1] Spiro, S. C., & Silvestri, G. A. (2005). One hundred years of lung cancer. American Journal of Respiratory and Critical Care Medicine, 172(5), 523–529. doi:10.1164/rccm.200504-531OE.
[2] Sung, H., Ferlay, J., Siegel, R. L., Laversanne, M., Soerjomataram, I., Jemal, A., & Bray, F. (2021). Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA: A Cancer Journal for Clinicians, 71(3), 209–249. doi:10.3322/caac.21660.
[3] CDC (2026). Lung Cancer. Centers for Disease Control and Prevention (CDC), Georgia, United States. Available online: https://www.cdc.gov/lung-cancer/index.html (accessed on January 2026).
[4] Ridge, C., McErlean, A. M., & Ginsberg, M. S. (2013). Epidemiology of lung cancer. Seminars in Interventional Radiology, 30(2), 93–98. doi:10.1055/s-0033-1342949.
[5] WHO (2026). Lung Cancer. World Health Organization (WHO), Geneva, Switzerland. Available online: https://www.who.int/news-room/fact-sheets/detail/lung-cancer (accessed on January 2026).
[6] WCRF (2022). Liver Cancer Statistics. World Cancer Research Fund International (WCRF), London, United Kingdom. Available online: https://www.wcrf.org/preventing-cancer/cancer-statistics/lung-cancer-statistics/ (accessed on January 2026).
[7] Collins, L., Haines, C., Perkel, R., & Enck, R. (2007). Lung Cancer: Diagnosis and Management. American Family Physician, 75, 56–63.
[8] Nooreldeen, R., & Bach, H. (2021). Current and future development in lung cancer diagnosis. International Journal of Molecular Sciences, 22(16), 8661. doi:10.3390/ijms22168661.
[9] Chen, A., Wu, E., Huang, R., Shen, B., Han, R., Wen, J., Zhang, Z., & Li, Q. (2024). Development of Lung Cancer Risk Prediction Machine Learning Models for Equitable Learning Health System: Retrospective Study. JMIR Publications (Preprint), 1-29. doi:10.2196/preprints.56590.
[10] Ebrahimi, A., Henriksen, M. B. H., Brasen, C. L., Hilberg, O., Hansen, T. F., Jensen, L. H., Peimankar, A., & Wiil, U. K. (2024). Identification of patients’ smoking status using an explainable AI approach: a Danish electronic health records case study. BMC Medical Research Methodology, 24(1), 114. doi:10.1186/s12874-024-02231-4.
[11] Wang, L., Yin, Y., Glampson, B., Peach, R., Barahona, M., Delaney, B. C., & Mayer, E. K. (2024). Transformer-based deep learning model for the diagnosis of suspected lung cancer in primary care based on electronic health record data. eBioMedicine, 110, 105442. doi:10.1016/j.ebiom.2024.105442.
[12] Bhattarai, K., Oh, I. Y., Sierra, J. M., Tang, J., Payne, P. R. O., Abrams, Z., & Lai, A. M. (2024). Leveraging GPT-4 for identifying cancer phenotypes in electronic health records: A performance comparison between GPT-4, GPT-3.5-turbo, Flan-T5, Llama-3-8B, and spaCy’s rule-based and machine learning-based methods. JAMIA Open, 7(3), ooae060. doi:10.1093/jamiaopen/ooae060.
[13] Su, Y., Zhan, H., Li, S., Lu, Y., Ma, R., Fang, H., Xu, T., & Tian, Y. (2025). Development and Validation of Machine Learning Models for Lung Cancer Risk Prediction in High-Risk Population: A Retrospective Cohort Study. Biomedical and Environmental Sciences, 38(4), 501–505. doi:10.3967/bes2025.038.
[14] Liz-López, H., de Sojo-Hernández, Á. A., D’Antonio-Maceiras, S., Díaz-Martínez, M. A., & Camacho, D. (2025). Deep Learning Innovations in the Detection of Lung Cancer: Advances, Trends, and Open Challenges. Cognitive Computation, 17(2), 67. doi:10.1007/s12559-025-10408-2.
[15] Rieke, N., Hancox, J., Li, W., Milletarì, F., Roth, H. R., Albarqouni, S., Bakas, S., Galtier, M. N., Landman, B. A., Maier-Hein, K., Ourselin, S., Sheller, M., Summers, R. M., Trask, A., Xu, D., Baust, M., & Cardoso, M. J. (2020). The future of digital health with federated learning. NPJ Digital Medicine, 3(1), 119. doi:10.1038/s41746-020-00323-1.
[16] Harvard Dataverse (2024). Synthetic patient data for lung cancer risk prediction machine learning – Synthetic Patient Data ML Dataverse. Harvard Dataverse, Massachusetts, United States. Available online: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GD5XWE (accessed on January 2026).
[17] Post, A. R., Burningham, Z., & Halwani, A. S. (2022). Electronic Health Record Data in Cancer Learning Health Systems: Challenges and Opportunities. JCO Clinical Cancer Informatics, 6. doi:10.1200/cci.21.00158.
[18] Wu, W., Parmar, C., Grossmann, P., Quackenbush, J., Lambin, P., Bussink, J., Mak, R., & Aerts, H. J. W. L. (2016). Exploratory study to identify radiomics classifiers for lung cancer histology. Frontiers in Oncology, 6, 71. doi:10.3389/fonc.2016.00071.
[19] Kumar, D., Chung, A. G., Shaifee, M. J., Khalvati, F., Haider, M. A., & Wong, A. (2017). Discovery radiomics for pathologically-proven computed tomography lung cancer prediction. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Vol. 10317 LNCS, 54–62. doi:10.1007/978-3-319-59876-5_7.
[20] Zhou, Z., Zhou, Z. J., Hao, H., Li, S., Chen, X., Zhang, Y., ... & Wang, J. (2017). Constructing multi-modality and multi-classifier radiomics predictive models through reliable classifier fusion. arXiv preprint, arXiv:1710.01614. doi:10.48550/arXiv.1710.01614.
[21] Yuan, F., Lu, L., & Zou, Q. (2020). Analysis of gene expression profiles of lung cancer subtypes with machine learning algorithms. Biochimica et Biophysica Acta - Molecular Basis of Disease, 1866(8), 165822. doi:10.1016/j.bbadis.2020.165822.
[22] Shin, H., Oh, S., Hong, S., Kang, M., Kang, D., Ji, Y. G., Choi, B. H., Kang, K. W., Jeong, H., Park, Y., Kim, H. K., & Choi, Y. (2020). Early-Stage Lung Cancer Diagnosis by Deep Learning-Based Spectroscopic Analysis of Circulating Exosomes. ACS Nano, 14(5), 5435–5444. doi:10.1021/acsnano.9b09119.
[23] Wang, R., Weng, Y., Zhou, Z., Chen, L., Hao, H., & Wang, J. (2019). Multi-objective ensemble deep learning using electronic health records to predict outcomes after lung cancer radiotherapy. Physics in Medicine and Biology, 64(24), ab555e. doi:10.1088/1361-6560/ab555e.
[24] Enhesari, A., Montazeri, M., & Baghshah, M. S. (2013). Hyper-Heuristic Algorithm for Finding Efficient Features in Diagnose of Lung Cancer Disease. Journal of Basic and Applied Scientific Research, 3(10), 134–140.
[25] Xie, Y., Meng, W. Y., Li, R. Z., Wang, Y. W., Qian, X., Chan, C., Yu, Z. F., Fan, X. X., Pan, H. D., Xie, C., Wu, Q. B., Yan, P. Y., Liu, L., Tang, Y. J., Yao, X. J., Wang, M. F., & Leung, E. L. H. (2021). Early lung cancer diagnostic biomarker discovery by machine learning methods. Translational Oncology, 14(1), 100907. doi:10.1016/j.tranon.2020.100907.
[26] Gould, M. K., Huang, B. Z., Tammemagi, M. C., Kinar, Y., & Shiff, R. (2021). Machine learning for early lung cancer identification using routine clinical and laboratory data. American Journal of Respiratory and Critical Care Medicine, 204(4), 445–453. doi:10.1164/rccm.202007-2791OC.
[27] Senthil Kumar, K., Venkatalakshmi, K., & Karthikeyan, K. (2019). Lung Cancer Detection Using Image Segmentation by means of Various Evolutionary Algorithms. Computational and Mathematical Methods in Medicine, 2019(1), 4909846. doi:10.1155/2019/4909846.
[28] Vijh, S., Gaurav, P., & Pandey, H. M. (2023). Hybrid bio-inspired algorithm and convolutional neural network for automatic lung tumor detection. Neural Computing and Applications, 35(33), 23711–23724. doi:10.1007/s00521-020-05362-z.
[29] Priyadharshini, P., & Zoraida, B. S. E. (2021). Bat-inspired metaheuristic convolutional neural network algorithms for CAD-based lung cancer prediction. Journal of Applied Science and Engineering, 24(1), 65–71. doi:10.6180/jase.202102_24(1).0008.
[30] Gupta, N., Gupta, D., Khanna, A., Rebouças Filho, P. P., & de Albuquerque, V. H. C. (2019). Evolutionary algorithms for automatic lung disease detection. Measurement: Journal of the International Measurement Confederation, 140, 590–608. doi:10.1016/j.measurement.2019.02.042.
[31] ALzubi, J. A., Bharathikannan, B., Tanwar, S., Manikandan, R., Khanna, A., & Thaventhiran, C. (2019). Boosted neural network ensemble classification for lung cancer disease diagnosis. Applied Soft Computing Journal, 80, 579–591. doi:10.1016/j.asoc.2019.04.031.
[32] Freund, Y., & Schapire, R. E. (1996). Experiments with a New Boosting Algorithm. Proceedings of the 13th International Conference on Machine Learning, 148–156. doi:10.1.1.133.1040.
[33] Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227. doi:10.1007/bf00116037.
[34] Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. doi:10.1023/A:1018054314350.
[35] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324.
[36] Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (2017). Classification and Regression Trees. Routledge, New York, United States. doi:10.1201/9781315139470.
[37] Franklin, J. (2005). The elements of statistical learning: data mining, inference and prediction. Mathematical Intelligencer, 27(2), 83-85. doi:10.1007/BF02985802.
[38] Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. doi:10.1214/aos/1013203451.
[39] Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. doi:10.1109/tit.1967.1053964.
[40] Dasarathy, B.V. (1991) Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, United States.
[41] Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression. Wiley Series in Probability and Statistics. Wiley, New Jersey, United States. doi:10.1002/9781118548387.
[42] Agresti, A. (2002). Categorical Data Analysis. Wiley Series in Probability and Statistics. Wiley, New Jersey, United States. doi:10.1002/0471249688.
[43] Heaton, J. (2017). Ian Goodfellow, Yoshua Bengio, and Aaron Courville: Deep learning. Genetic Programming and Evolvable Machines, 19(1–2), 305–307. doi:10.1007/s10710-017-9314-z.
[44] Nielsen, M. (2026). Neural Networks and Deep Learning. Available online: http://neuralnetworksanddeeplearning.com/ (accessed on January 2026).
[45] Rish, I. (2001) An Empirical Study of the Naive Bayes Classifier. IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, 4 August 2001, 41-46.
[46] Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press, Massachusetts, United States. doi:10.1017/cbo9780511809071.
[47] Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3–42. doi:10.1007/s10994-006-6226-1.
[48] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. doi:10.1007/bf00994018.
[49] Shawe-Taylor, J., & Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge University Press, Massachusetts, United States. doi:10.1017/cbo9780511809682
[50] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-17-August-2016, 785–794. doi:10.1145/2939672.2939785.
[51] Holland, J. H. (1992). Adaptation in Natural and Artificial Systems. MIT Press, Massachusetts, United States. doi:10.7551/mitpress/1090.001.0001.
[52] Goldberg, D. E. (1989). Genetic algorithms in search, optimization, and machine learning. Choice Reviews Online, 27(02), 27-0936. doi:10.5860/choice.27-0936.
[53] Mirjalili, S., Mirjalili, S. M., & Lewis, A. (2014). Grey Wolf Optimizer. Advances in Engineering Software, 69, 46–61. doi:10.1016/j.advengsoft.2013.12.007.
[54] Mirjalili, S. (2015). How effective is the Grey Wolf optimizer in training multi-layer perceptrons. Applied Intelligence, 43(1), 150–161. doi:10.1007/s10489-014-0645-7.
[55] Mirjalili, S. (2015). Moth-flame optimization algorithm: A novel nature-inspired heuristic paradigm. Knowledge-Based Systems, 89, 228–249. doi:10.1016/j.knosys.2015.07.006.
[56] Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. Proceedings of ICNN’95 - International Conference on Neural Networks, 4, 1942–1948. doi:10.1109/icnn.1995.488968.
[57] Shi, Y., & Eberhart, R. (1998). Modified particle swarm optimizer. Proceedings of the IEEE Conference on Evolutionary Computation, ICEC, 69–73. doi:10.1109/icec.1998.699146.
[58] Mirjalili, S., & Lewis, A. (2016). The Whale Optimization Algorithm. Advances in Engineering Software, 95, 51–67. doi:10.1016/j.advengsoft.2016.01.008.
[59] Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI International Joint Conference on Artificial Intelligence, 2, 1137–1143.
[60] Powers, D. M. W. (2020). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv Preprint, arXiv:1504.00854. doi:10.48550/arXiv.2010.16061.
[61] Baeza-Yates, R. A., & Ribeiro-Neto, B. (1999). Modern Information Retrieval. ACM Press, New York, United States.
[62] Blair, D. C. (1979). Information Retrieval, 2nd ed. C.J. Van Rijsbergen. London: Butterworths; 1979: 208 pp. Price: $32.50. Journal of the American Society for Information Science, 30(6), 374–375. doi:10.1002/asi.4630300621.
[63] Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874. doi:10.1016/j.patrec.2005.10.010.
- This work (including HTML and PDF Files) is licensed under a Creative Commons Attribution 4.0 International License.



















