Improving the Quality Indicators of Multilevel Data Sampling Processing Models Based on Unsupervised Clustering

Ilya S. Lebedev, Mikhail E. Sukhoparov


This paper presents a solution for building and implementing data processing models and experimentally evaluates new possibilities for improving ensemble methods based on multilevel data processing models. This study proposes a model to reduce the cost of retraining models when transforming data properties. The research objective is to improve the quality indicators of machine learning models when solving classification problems. The novelty is a method that uses a multilevel architecture of data processing models to determine the current data properties in segments at different levels and assign algorithms with the best quality indicators. This method differs from the known ones by using several model levels that analyze data properties and assign the best models to individual segments of data and training. The improvement consists of using unsupervised clustering of data samples. The resulting clusters are separate subsamples for assigning the best machine-learning models and algorithms. Experimental values of quality indicators for different classifiers on the whole sample and different segments were obtained. The findings show that unsupervised clustering using multilevel models can significantly improve the quality indicators of “weak” classifiers. The quality indicators of individual classifiers improve when the number of data clusters is increased to a certain threshold. The results obtained are applicable to classification when developing models and machine learning methods. The proposed method improved the classification quality indicators by 2–9% due to segmentation and the assignment of models with the best quality indicators in individual segments.


Doi: 10.28991/ESJ-2024-08-01-025

Full Text: PDF


Regression; Data Structure; Prediction; Simulation.


Wu, Y., Zhao, R., Zhu, J., Chen, F., Xu, M., Li, G., Song, S., Deng, L., Wang, G., Zheng, H., Ma, S., Pei, J., Zhang, Y., Zhao, M., & Shi, L. (2022). Brain-inspired global-local learning incorporated with neuromorphic computing. Nature Communications, 13(1), 65. doi:10.1038/s41467-021-27653-2.

Mohammed, A., & Kora, R. (2023). A comprehensive review on ensemble deep learning: Opportunities and challenges. Journal of King Saud University - Computer and Information Sciences, 35(2), 757–774. doi:10.1016/j.jksuci.2023.01.014.

de Zarzà, I., de Curtò, J., Hernández-Orallo, E., & Calafate, C. T. (2023). Cascading and Ensemble Techniques in Deep Learning. Electronics (Switzerland), 12(15), 3354. doi:10.3390/electronics12153354.

Mienye, I. D., & Sun, Y. (2022). A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects. IEEE Access, 10, 99129–99149. doi:10.1109/ACCESS.2022.3207287.

Akano, T. T., & James, C. C. (2022). An assessment of ensemble learning approaches and single-based machine learning algorithms for the characterization of undersaturated oil viscosity. Beni-Suef University Journal of Basic and Applied Sciences, 11(1), 149. doi:10.1186/s43088-022-00327-8.

Mishra, S., Shaw, K., Mishra, D., Patil, S., Kotecha, K., Kumar, S., & Bajaj, S. (2022). Improving the Accuracy of Ensemble Machine Learning Classification Models Using a Novel Bit-Fusion Algorithm for Healthcare AI Systems. Frontiers in Public Health, 10. doi:10.3389/fpubh.2022.858282.

Valencia-Vidal, B., Ros, E., Abadía, I., & Luque, N. R. (2023). Bidirectional recurrent learning of inverse dynamic models for robots with elastic joints: a real-time real-world implementation. Frontiers in Neurorobotics, 17. doi:10.3389/fnbot.2023.1166911.

Zhang, Y., Liu, J., & Shen, W. (2022). A Review of Ensemble Learning Algorithms Used in Remote Sensing Applications. Applied Sciences (Switzerland), 12(17), 8654. doi:10.3390/app12178654.

Trevizan, B., Chamby-Diaz, J., Bazzan, A. L. C., & Recamonde-Mendoza, M. (2020). A comparative evaluation of aggregation methods for machine learning over vertically partitioned data. Expert Systems with Applications, 152, 113406. doi:10.1016/j.eswa.2020.113406.

Wang, S., Zhou, W., & Jiang, C. (2020). A survey of word embeddings based on deep learning. Computing, 102(3), 717–740. doi:10.1007/s00607-019-00768-7.

Vousden, M., Morris, J., McLachlan Bragg, G., Beaumont, J., Rafiev, A., Luk, W., Thomas, D., & Brown, A. (2023). Event-based high throughput computing: A series of case studies on a massively parallel softcore machine. IET Computers and Digital Techniques, 17(1), 29–42. doi:10.1049/cdt2.12051.

Huang, J., Chen, P., Lu, L., Deng, Y., & Zou, Q. (2023). WCDForest: a weighted cascade deep forest model toward the classification tasks. Applied Intelligence, 53(23), 29169–29182. doi:10.1007/s10489-023-04794-z.

Brown, A. D., Beaumont, J. R., Thomas, D. B., Shillcock, J. C., Naylor, M. F., Bragg, G. M., Vousden, M. L., Moore, S. W., & Fleming, S. T. (2023). POETS: An Event-driven Approach to Dissipative Particle Dynamics. ACM Transactions on Parallel Computing, 10(2), 1–32. doi:10.1145/3580372.

Marques, H. O., Swersky, L., Sander, J., Campello, R. J. G. B., & Zimek, A. (2023). On the evaluation of outlier detection and one-class classification: a comparative study of algorithms, model selection, and ensembles. Data Mining and Knowledge Discovery, 37(4), 1473–1517. doi:10.1007/s10618-023-00931-x.

Huang, W., & Ding, N. (2021). Privacy-Preserving Support Vector Machines with Flexible Deployment and Error Correction. In: Deng, R., et al. Information Security Practice and Experience. ISPEC 2021. Lecture Notes in Computer Science, 13107. Springer, Cham, Switzerland. doi:10.1007/978-3-030-93206-0_15.

Liu, N., & Zhao, J. (2023). Streaming Data Classification Based on Hierarchical Concept Drift and Online Ensemble. IEEE Access, 11, 126040–126051. doi:10.1109/ACCESS.2023.3327637.

Xu, H., Zhang, Y., Zhou, B., Wang, L., Yao, X., Meng, G., & Shen, S. (2022). Omni-Swarm: A Decentralized Omnidirectional Visual-Inertial-UWB State Estimation System for Aerial Swarms. IEEE Transactions on Robotics, 38(6), 3374–3394. doi:10.1109/TRO.2022.3182503.

Zhang, X., & Wang, M. (2021). Weighted Random Forest Algorithm Based on Bayesian Algorithm. Journal of Physics: Conference Series, 1924(1), 12006. doi:10.1088/1742-6596/1924/1/012006.

Colter, Z., Fayazi, M., Youbi, Z. B. El, Kamp, S., Yu, S., & Dreslinski, R. (2022). Tablext: A combined neural network and heuristic based table extractor. Array, 15, 100220. doi:10.1016/j.array.2022.100220.

Di Franco, G., & Santurro, M. (2021). Machine learning, artificial neural networks and social research. Quality & Quantity, 55(3), 1007–1025. doi:10.1007/s11135-020-01037-y.

Piernik, M., & Morzy, T. (2021). A study on using data clustering for feature extraction to improve the quality of classification. Knowledge and Information Systems, 63(7), 1771–1805. doi:10.1007/s10115-021-01572-6.

ChauPattnaik, S., Ray, M., & Nayak, M. M. (2021). Component based reliability prediction. International Journal of System Assurance Engineering and Management, 12(3), 391–406. doi:10.1007/s13198-021-01079-x.

Si, S., Zhao, J., Cai, Z., & Dui, H. (2020). Recent advances in system reliability optimization driven by importance measures. Frontiers of Engineering Management, 7(3), 335–358. doi:10.1007/s42524-020-0112-6.

Djouzi, K., Beghdad-Bey, K., & Amamra, A. (2022). A new adaptive sampling algorithm for big data classification. Journal of Computational Science, 61, 101653. doi:10.1016/j.jocs.2022.101653.

Lebedev, I. S., & Sukhoparov, M. E. (2023). Adaptive Learning and Integrated Use of Information Flow Forecasting Methods. Emerging Science Journal, 7(3), 704–723. doi:10.28991/ESJ-2023-07-03-03.

Sugita, I., Matsuyama, S., Dobashi, H., Komura, D., & Ishikawa, S. (2022). Viola: a structural variant signature extractor with user-defined classifications. Bioinformatics, 38(2), 540–542. doi:10.1093/bioinformatics/btab662.

Peruvemba Ramaswamy, V., & Szeider, S. (2021). Turbocharging Treewidth-Bounded Bayesian Network Structure Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(5), 3895–3903. doi:10.1609/aaai.v35i5.16508.

Debnath, S., Arif, W., Roy, S., Baishya, S., & Sen, D. (2022). A Comprehensive Survey of Emergency Communication Network and Management. Wireless Personal Communications, 124(2), 1375–1421. doi:10.1007/s11277-021-09411-1.

Adeen, N., Abdulazeez, M., & Zeebaree, D. (2020). Systematic review of unsupervised genomic clustering algorithms techniques for high dimensional datasets. Technol. Reports Kansai University, 62(3), 355-374.

Saraçoğlu, R., & Nemati, N. (2020). Vehicle Detection Using Fuzzy C-Means Clustering Algorithm. International Journal of Applied Mathematics Electronics and Computers, 8(3), 85–91. doi:10.18100/ijamec.799431.

Sri, K. S., Nayaka, R. R., & Kumar, M. V. N. S. (2023). Mechanical properties of sustainable self-healing concrete and its performance evaluation using ANN and ANFIS models. Journal of Building Pathology and Rehabilitation, 8(2), 99. doi:10.1007/s41024-023-00345-8.

Xu, S., Song, Y., & Hao, X. (2022). A Comparative Study of Shallow Machine Learning Models and Deep Learning Models for Landslide Susceptibility Assessment Based on Imbalanced Data. Forests, 13(11), 1908. doi:10.3390/f13111908.

Mehrabi, M., Pradhan, B., Moayedi, H., & Alamri, A. (2020). Optimizing an adaptive neuro-fuzzy inference system for spatial prediction of landslide susceptibility using four state-of-the-art metaheuristic techniques. Sensors (Switzerland), 20(6), 1723. doi:10.3390/s20061723.

Wei, A., Yu, K., Dai, F., Gu, F., Zhang, W., & Liu, Y. (2022). Application of Tree-Based Ensemble Models to Landslide Susceptibility Mapping: A Comparative Study. Sustainability (Switzerland), 14(10), 6330. doi:10.3390/su14106330.

Ji, X., Liu, S., Zhao, P., Li, X., & Liu, Q. (2021). Clustering Ensemble Based on Sample’s Certainty. Cognitive Computation, 13(4), 1034–1046. doi:10.1007/s12559-021-09876-z.

Zhong, G., Shu, T., Huang, G., & Yan, X. (2022). Multi-view spectral clustering by simultaneous consensus graph learning and discretization. Knowledge-Based Systems, 235, 107632. doi:10.1016/j.knosys.2021.107632.

Tong, W., Wang, Y., & Liu, D. (2023). An Adaptive Clustering Algorithm Based on Local-Density Peaks for Imbalanced Data Without Parameters. IEEE Transactions on Knowledge and Data Engineering, 35(4), 3419–3432. doi:10.1109/TKDE.2021.3138962.

He, H., Liu, W., Zhao, Z., He, S., & Zhang, J. (2022). Vulnerability of Regional Aviation Networks Based on DBSCAN and Complex Networks. Computer Systems Science and Engineering, 43(2), 643–655. doi:10.32604/csse.2022.027211.

Tkachenko, R. (2022). An Integral Software Solution of the SGTM Neural-Like Structures Implementation for Solving Different Data Mining Tasks. Lecture Notes in Computational Intelligence and Decision Making, ISDMCI 2021, Lecture Notes on Data Engineering and Communications Technologies, 77, Springer, Cham, Switzerland. doi:10.1007/978-3-030-82014-5_48.

Nai-Arun, N., & Moungmai, R. (2015). Comparison of Classifiers for the Risk of Diabetes Prediction. Procedia Computer Science, 69, 132–142. doi:10.1016/j.procs.2015.10.014.

Iyer, A., S, J., & Sumbaly, R. (2015). Diagnosis of Diabetes Using Classification Mining Techniques. International Journal of Data Mining & Knowledge Management Process, 5(1), 01–14. doi:10.5121/ijdkp.2015.5101.

Zou, Q., Qu, K., Luo, Y., Yin, D., Ju, Y., & Tang, H. (2018). Predicting Diabetes Mellitus with Machine Learning Techniques. Frontiers in Genetics, 9. doi:10.3389/fgene.2018.00515.

Sharma, T., & Shah, M. (2021). A comprehensive review of machine learning techniques on diabetes detection. Visual Computing for Industry, Biomedicine, and Art, 4, 30. doi:10.1186/s42492-021-00097-7.

Yahyaoui, A., Jamil, A., Rasheed, J., & Yesiltepe, M. (2019). A Decision Support System for Diabetes Prediction Using Machine Learning and Deep Learning Techniques. 2019 1st International Informatics and Software Engineering Conference (UBMYK), Ankara, Turkey. doi:10.1109/ubmyk48245.2019.8965556.

Rahbar, M. A. (2022). Evaluation of the hybrid method of genetic algorithm and adaptive neural-fuzzy network (ANFIS) model in predicting the bankruptcy of companies listed on the Tehran stock exchange. Journal of Applied Research on Industrial Engineering, 9(3), 274–290. doi:10.22105/jarie.2021.254142.1204.

Novakovic, J. D. (2015). Estimating Performances of Learned Knowledge for the RBF Network as an Artificial Intelligence Method. Strategic Management, 20(4), 46–53.

Muslim, M. A., Nikmah, T. L., Pertiwi, D. A. A., Subhan, Jumanto, Dasril, Y., & Iswanto. (2023). New model combination meta-learner to improve accuracy prediction P2P lending with stacking ensemble learning. Intelligent Systems with Applications, 18, 200204. doi:10.1016/j.iswa.2023.200204.

Full Text: PDF

DOI: 10.28991/ESJ-2024-08-01-025


  • There are currently no refbacks.

Copyright (c) 2024 Ilya S. Lebedev, Mikhail E. Sukhoparov