Enhancing Small Language Models for Code Generation via Strategic Decomposition and Filtering

Yuriy Perezhohin; Fabian Collao; Mauro Castelli

doi:10.28991/ESJ-2026-010-02-011

Authors

Yuriy Perezhohin 1) NOVA Information Management School (NOVA IMS), Universidade NOVA de Lisboa, 1070-312 Lisboa, Portugal. 2) Remynd, Alameda Bonifacio Lazaro Lozano, nº 15, 1º C, 2780-125 Oeiras, Portugal https://orcid.org/0009-0004-1046-7883
Fabian Collao NOVA Information Management School (NOVA IMS), Universidade NOVA de Lisboa, 1070-312 Lisboa, Portugal
Mauro Castelli
mcastelli@novaims.unl.pt
NOVA Information Management School (NOVA IMS), Universidade NOVA de Lisboa, 1070-312 Lisboa, Portugal https://orcid.org/0000-0002-8793-1451

Vol. 10 No. 2 (2026): April

Research Articles

Downloads

PDF

Abstract
How to Cite
Metrics
References
License

This study addresses the challenge of enhancing Small Language Models (SLMs) for complex code generation tasks requiring structured planning, which current models struggle with due to their monolithic, single-pass generation approach. A three-stage pipeline architecture is proposed that decouples strategic planning from implementation: (1) an SLM generates diverse natural language strategies at high temperature, (2) a filtering mechanism selects high-quality strategies while removing noise, and (3) refined strategies guide a specialized coding model for final implementation. The approach was evaluated on the ClassEval benchmark for class-level code generation. The pipeline enabled a 1.5B parameter model to achieve 13% class success rate, representing a 30% relative improvement over direct generation (10%) and competitive performance with models 5-8 times larger. Critically, effective strategy filtering proved more important than strategy diversity, with simple pattern-based filters successfully mitigating SLM artifacts like few-shot contamination. This work demonstrates that structured, inference-time computation offers an efficient alternative to parameter scaling, with strategic noise reduction being the key driver of performance gains in resource-constrained models.

[1] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., ... & Zaremba, W. (2021). Evaluating large language models trained on code. arXiv Preprint, arXiv:2107.03374. doi:10.48550/arXiv.2107.03374

[2] Wang, E., Cassano, F., Wu, C., Bai, Y., Song, W., Nath, V., Han, Z., Hendryx, S., Yue, S., & Zhang, H. (2025). Planning in Natural Language Improves LLM Search for Code Generation. 13th International Conference on Learning Representations, ICLR 2025, 7230–7276.

[3] Abbassi, A. A., Da Silva, L., Nikanjam, A., & Khomh, F. (2025). A Taxonomy of Inefficiencies in LLM-Generated Python Code. IEEE International Conference on Software Maintenance and Evolution (ICSME2025), 393-404. doi:10.1109/ICSME64153.2025.00043

[4] Wirth, N. (1983). Program Development by Stepwise Refinement. Communications of the ACM, 26(1), 70–74. doi:10.1145/357980.358010.

[5] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, 5999–6009. doi:10.1201/9781003561460-19.

[6] Kahneman, D. (2011). Fast and slow thinking. Allen Lane and Penguin Books, New York, United States.

[7] Soloway, E., & Ehrlich, K. (1984). Empirical Studies of Programming Knowledge. IEEE Transactions on Software Engineering, SE-10(5), 595–609. doi:10.1109/TSE.1984.5010283.

[8] Song, T., & Becker, K. (2014). Expert vs. novice: Problem decomposition/recomposition in engineering design. International Conference on interactive collaborative learning (ICL2014), 181-190. doi:10.1109/ICL.2014.7017768.

[9] Liu, J., Xia, C. S., Wang, Y., & Zhang, L. (2023). Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. Advances in Neural Information Processing Systems, 36, 21558–21572.

[10] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 35, 24824–24837.

[11] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E. H., Narang, S., Chowdhery, A., & Zhou, D. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. 11th International Conference on Learning Representations, ICLR 2023, 1-24.

[12] Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., & Chi, E. (2023). Least-To-Most Prompting Enables Complex Reasoning in Large Language Models. In 11th International Conference on Learning Representations, ICLR 2023, 1-61.

[13] Katz, M., Kokel, H., Srinivas, K., & Sohrabi, S. (2024). Thought of Search: Planning with Language Models Through the Lens of Efficiency. Advances in Neural Information Processing Systems, 37, 138491–138568. doi:10.52202/079017-4395.

[14] Hernández-Gutiérrez, S., Alakuijala, M., Nikitin, A. V., & Marttinen, P. (2025). Recursive decomposition with dependencies for generic divide-and-conquer reasoning. arXiv Preprint, arXiv:2505.02576. doi:10.48550/arXiv.2505.02576.

[15] Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36, 46595–46623.

[16] Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment. EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings, 2511–2522. doi:10.18653/v1/2023.emnlp-main.153.

[17] Dong, Y., Ding, J., Jiang, X., Li, G., Li, Z., & Jin, Z. (2025). CodeScore: Evaluating Code Generation by Learning Code Execution. ACM Transactions on Software Engineering and Methodology, 34(3), 1–22. doi:10.1145/3695991.

[18] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., ... & Amodei, D. (2020). Scaling laws for neural language models. arXiv PREPRINT, arXiv:2001.08361. doi:10.48550/arXiv.2001.08361.

[19] Snell, C., Lee, J., Xu, K., & Kumar, A. (2025). Scaling Llm Test-Time Compute Optimally Can Be More Effective Than Scaling Parameters for Reasoning. 13th International Conference on Learning Representations, ICLR 2025, 7595–7629.

[20] Du, X., Liu, M., Wang, K., Wang, H., Liu, J., Chen, Y., ... & Lou, Y. (2023). Classeval: A manually-crafted benchmark for evaluating LLMs on class-level code generation. arXiv Preprint, arXiv:2308.01861. doi:10.48550/arXiv.2308.01861.

[21] Balachandran, V., Chen, J., Chen, L., Garg, S., Joshi, N., Lara, Y., ... & Yousefi, S. (2025). Inference-time scaling for complex tasks: Where we stand and what lies ahead. arXiv Preprint, arXiv:2504.00294. doi:10.48550/arXiv.2504.00294.

[22] Li, D., Cao, S., Cao, C., Li, X., Tan, S., Keutzer, K., Xing, J., Gonzalez, J. E., & Stoica, I. (2025). S*: Test Time Scaling for Code Generation, 15964–15978. doi:10.18653/v1/2025.findings-emnlp.865.

[23] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Advances in Neural Information Processing Systems, 36, 11809–11822.

[24] Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P., & Hoefler, T. (2024). Graph of Thoughts: Solving Elaborate Problems with Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17682–17690. doi:10.1609/aaai.v38i16.29720.

[25] Ishibashi, Y., & Nishimura, Y. (2024). Self-organized agents: A LLM multi-agent framework toward ultra large-scale code generation and optimization. arXiv Preprint, arXiv:2404.02183. doi:10.48550/arXiv.2404.02183.

[26] Islam, M. A., Ali, M. E., & Parvez, M. R. (2024). MapCoder: Multi-Agent Code Generation for Competitive Problem Solving. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 4912–4944. doi:10.18653/v1/2024.acl-long.269.

[27] Souza, D., Gheyi, R., Albuquerque, L., Soares, G., & Ribeiro, M. (2025). Code Generation with Small Language Models: A Codeforces-Based Study. arXiv Preprint, arXiv:2504.07343. doi:10.48550/arXiv.2504.07343.

[28] Jiang, J., Wang, F., Shen, J., Kim, S., & Kim, S. (2026). A survey on large language models for code generation. ACM Transactions on Software Engineering and Methodology, 35(2), 1-72. doi:10.1145/3747588.

[29] Renze, M., & Guven, E. (2024). The Effect of Sampling Temperature on Problem Solving in Large Language Models. EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024, 7346–7356. doi:10.18653/v1/2024.findings-emnlp.432.

[30] Minh, N. N., Baker, A., Neo, C., Roush, A., Kirsch, A., & Shwartz-Ziv, R. (2025). Turning Up the Heat: Min-P Sampling for Creative and Coherent LLM Outputs. 13th International Conference on Learning Representations, ICLR 2025, 34593–34626.

[31] Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., ... & Lin, J. (2024). Qwen2.5-coder technical report. arXiv Preprint, arXiv:2409.12186. doi:10.48550/arXiv.2409.12186.

[32] Liu, X., Sun, X., Bo, L., Hu, Y., Liu, X., & Ye, Z. (2025). Evaluating the test adequacy of benchmarks for LLMs on code generation. Journal of Software: Evolution and Process, 37(7), e70034. doi:10.1002/smr.70034.

[33] Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., ... & Sutton, C. (2021). Program synthesis with large language models. arXiv Preprint, arXiv:2108.07732. doi:10.48550/arXiv.2108.07732.

[34] Zhuo, T. Y., Chien, V. M., Chim, J., Hu, H., Yu, W., Widyasari, R., ... & Von Werra, L. BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions. The Thirteenth International Conference on Learning Representations, ICLR 2025, 1-55.

[35] Zhang, F., Chen, B., Zhang, Y., Keung, J., Liu, J., Zan, D., Mao, Y., Lou, J. G., & Chen, W. (2023). RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings, 2471–2484. doi:10.18653/v1/2023.emnlp-main.151.

Acceptance Rate:	21%
Review Speed:	74 days
Issue Per Year:	6
Number of Volumes:	7
Number of Issues:	44
Number of Articles:	493
Number of Reviewers:	1187
Number of Contributors:	1394
Contributing Countries:	83
No. of WoS Citations:	2609
No. of Scopus Citations:	2936
No. of Google Citations:	4161
Google h-index:	29
Google i10-index:	126
Abstract Views:	681,807
PDF Download:	492,524

Enhancing Small Language Models for Code Generation via Strategic Decomposition and Filtering

Authors

Downloads

Downloads

Login

submission

Publisher & Affiliated Societies

Indexing & Abstracting

SidebarMenu

IndexedBy

Indexing and Abstracting

twitter

Social Media

Analytics

Analytics

Information

Most Cited Articles

Digital Transformation: Opportunities and Challenges for Leaders in the Emerging Countries in Response to Covid-19 Pandemic

Thermal Regeneration and Reuse of Carbon and Glass Fibers from Waste Composites

Impediments of Green Finance Adoption System: Linking Economy and Environment

Optical and Structural Characterization of Bi2FexNbO7 Nanoparticles for Environmental Applications

Address

Contact Info:

Enhancing Small Language Models for Code Generation via Strategic Decomposition and Filtering

Authors

Downloads

Downloads

Login

submission

Publisher & Affiliated Societies

Indexing & Abstracting

SidebarMenu

social

Journal Imprint

Journal Metrics

IndexedBy

Indexing and Abstracting

twitter

Social Media

Analytics

Analytics

Information

Most Cited Articles

Digital Transformation: Opportunities and Challenges for Leaders in the Emerging Countries in Response to Covid-19 Pandemic

Thermal Regeneration and Reuse of Carbon and Glass Fibers from Waste Composites

Impediments of Green Finance Adoption System: Linking Economy and Environment

Optical and Structural Characterization of Bi2FexNbO7 Nanoparticles for Environmental Applications