Enhancing Small Language Models for Code Generation via Strategic Decomposition and Filtering
Downloads
This study addresses the challenge of enhancing Small Language Models (SLMs) for complex code generation tasks requiring structured planning, which current models struggle with due to their monolithic, single-pass generation approach. A three-stage pipeline architecture is proposed that decouples strategic planning from implementation: (1) an SLM generates diverse natural language strategies at high temperature, (2) a filtering mechanism selects high-quality strategies while removing noise, and (3) refined strategies guide a specialized coding model for final implementation. The approach was evaluated on the ClassEval benchmark for class-level code generation. The pipeline enabled a 1.5B parameter model to achieve 13% class success rate, representing a 30% relative improvement over direct generation (10%) and competitive performance with models 5-8 times larger. Critically, effective strategy filtering proved more important than strategy diversity, with simple pattern-based filters successfully mitigating SLM artifacts like few-shot contamination. This work demonstrates that structured, inference-time computation offers an efficient alternative to parameter scaling, with strategic noise reduction being the key driver of performance gains in resource-constrained models.
Downloads
[1] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., ... & Zaremba, W. (2021). Evaluating large language models trained on code. arXiv Preprint, arXiv:2107.03374. doi:10.48550/arXiv.2107.03374
[2] Wang, E., Cassano, F., Wu, C., Bai, Y., Song, W., Nath, V., Han, Z., Hendryx, S., Yue, S., & Zhang, H. (2025). Planning in Natural Language Improves LLM Search for Code Generation. 13th International Conference on Learning Representations, ICLR 2025, 7230–7276.
[3] Abbassi, A. A., Da Silva, L., Nikanjam, A., & Khomh, F. (2025). A Taxonomy of Inefficiencies in LLM-Generated Python Code. IEEE International Conference on Software Maintenance and Evolution (ICSME2025), 393-404. doi:10.1109/ICSME64153.2025.00043
[4] Wirth, N. (1983). Program Development by Stepwise Refinement. Communications of the ACM, 26(1), 70–74. doi:10.1145/357980.358010.
[5] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, 5999–6009. doi:10.1201/9781003561460-19.
[6] Kahneman, D. (2011). Fast and slow thinking. Allen Lane and Penguin Books, New York, United States.
[7] Soloway, E., & Ehrlich, K. (1984). Empirical Studies of Programming Knowledge. IEEE Transactions on Software Engineering, SE-10(5), 595–609. doi:10.1109/TSE.1984.5010283.
[8] Song, T., & Becker, K. (2014). Expert vs. novice: Problem decomposition/recomposition in engineering design. International Conference on interactive collaborative learning (ICL2014), 181-190. doi:10.1109/ICL.2014.7017768.
[9] Liu, J., Xia, C. S., Wang, Y., & Zhang, L. (2023). Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. Advances in Neural Information Processing Systems, 36, 21558–21572.
[10] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 35, 24824–24837.
[11] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E. H., Narang, S., Chowdhery, A., & Zhou, D. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. 11th International Conference on Learning Representations, ICLR 2023, 1-24.
[12] Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., & Chi, E. (2023). Least-To-Most Prompting Enables Complex Reasoning in Large Language Models. In 11th International Conference on Learning Representations, ICLR 2023, 1-61.
[13] Katz, M., Kokel, H., Srinivas, K., & Sohrabi, S. (2024). Thought of Search: Planning with Language Models Through the Lens of Efficiency. Advances in Neural Information Processing Systems, 37, 138491–138568. doi:10.52202/079017-4395.
[14] Hernández-Gutiérrez, S., Alakuijala, M., Nikitin, A. V., & Marttinen, P. (2025). Recursive decomposition with dependencies for generic divide-and-conquer reasoning. arXiv Preprint, arXiv:2505.02576. doi:10.48550/arXiv.2505.02576.
[15] Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36, 46595–46623.
[16] Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment. EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings, 2511–2522. doi:10.18653/v1/2023.emnlp-main.153.
[17] Dong, Y., Ding, J., Jiang, X., Li, G., Li, Z., & Jin, Z. (2025). CodeScore: Evaluating Code Generation by Learning Code Execution. ACM Transactions on Software Engineering and Methodology, 34(3), 1–22. doi:10.1145/3695991.
[18] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., ... & Amodei, D. (2020). Scaling laws for neural language models. arXiv PREPRINT, arXiv:2001.08361. doi:10.48550/arXiv.2001.08361.
[19] Snell, C., Lee, J., Xu, K., & Kumar, A. (2025). Scaling Llm Test-Time Compute Optimally Can Be More Effective Than Scaling Parameters for Reasoning. 13th International Conference on Learning Representations, ICLR 2025, 7595–7629.
[20] Du, X., Liu, M., Wang, K., Wang, H., Liu, J., Chen, Y., ... & Lou, Y. (2023). Classeval: A manually-crafted benchmark for evaluating LLMs on class-level code generation. arXiv Preprint, arXiv:2308.01861. doi:10.48550/arXiv.2308.01861.
[21] Balachandran, V., Chen, J., Chen, L., Garg, S., Joshi, N., Lara, Y., ... & Yousefi, S. (2025). Inference-time scaling for complex tasks: Where we stand and what lies ahead. arXiv Preprint, arXiv:2504.00294. doi:10.48550/arXiv.2504.00294.
[22] Li, D., Cao, S., Cao, C., Li, X., Tan, S., Keutzer, K., Xing, J., Gonzalez, J. E., & Stoica, I. (2025). S*: Test Time Scaling for Code Generation, 15964–15978. doi:10.18653/v1/2025.findings-emnlp.865.
[23] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Advances in Neural Information Processing Systems, 36, 11809–11822.
[24] Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P., & Hoefler, T. (2024). Graph of Thoughts: Solving Elaborate Problems with Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17682–17690. doi:10.1609/aaai.v38i16.29720.
[25] Ishibashi, Y., & Nishimura, Y. (2024). Self-organized agents: A LLM multi-agent framework toward ultra large-scale code generation and optimization. arXiv Preprint, arXiv:2404.02183. doi:10.48550/arXiv.2404.02183.
[26] Islam, M. A., Ali, M. E., & Parvez, M. R. (2024). MapCoder: Multi-Agent Code Generation for Competitive Problem Solving. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 4912–4944. doi:10.18653/v1/2024.acl-long.269.
[27] Souza, D., Gheyi, R., Albuquerque, L., Soares, G., & Ribeiro, M. (2025). Code Generation with Small Language Models: A Codeforces-Based Study. arXiv Preprint, arXiv:2504.07343. doi:10.48550/arXiv.2504.07343.
[28] Jiang, J., Wang, F., Shen, J., Kim, S., & Kim, S. (2026). A survey on large language models for code generation. ACM Transactions on Software Engineering and Methodology, 35(2), 1-72. doi:10.1145/3747588.
[29] Renze, M., & Guven, E. (2024). The Effect of Sampling Temperature on Problem Solving in Large Language Models. EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024, 7346–7356. doi:10.18653/v1/2024.findings-emnlp.432.
[30] Minh, N. N., Baker, A., Neo, C., Roush, A., Kirsch, A., & Shwartz-Ziv, R. (2025). Turning Up the Heat: Min-P Sampling for Creative and Coherent LLM Outputs. 13th International Conference on Learning Representations, ICLR 2025, 34593–34626.
[31] Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., ... & Lin, J. (2024). Qwen2.5-coder technical report. arXiv Preprint, arXiv:2409.12186. doi:10.48550/arXiv.2409.12186.
[32] Liu, X., Sun, X., Bo, L., Hu, Y., Liu, X., & Ye, Z. (2025). Evaluating the test adequacy of benchmarks for LLMs on code generation. Journal of Software: Evolution and Process, 37(7), e70034. doi:10.1002/smr.70034.
[33] Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., ... & Sutton, C. (2021). Program synthesis with large language models. arXiv Preprint, arXiv:2108.07732. doi:10.48550/arXiv.2108.07732.
[34] Zhuo, T. Y., Chien, V. M., Chim, J., Hu, H., Yu, W., Widyasari, R., ... & Von Werra, L. BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions. The Thirteenth International Conference on Learning Representations, ICLR 2025, 1-55.
[35] Zhang, F., Chen, B., Zhang, Y., Keung, J., Liu, J., Zan, D., Mao, Y., Lou, J. G., & Chen, W. (2023). RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings, 2471–2484. doi:10.18653/v1/2023.emnlp-main.151.
- This work (including HTML and PDF Files) is licensed under a Creative Commons Attribution 4.0 International License.



















