End-to-End Bangla AI for Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach

端到端孟加拉语 AI 解决数学奥林匹克问题基准：利用大语言模型的集成方法

H.M.Shadman Tabib and Jaber Ahmed Deedar

H.M.Shadman Tabib 和 Jaber Ahmed Deedar

Department of Computer Science and Engineering,Bangladesh University of Engineering and Technology Dhaka,Bangladesh

计算机科学与工程系，孟加拉国工程技术大学，达卡，孟加拉国

Abstract. This work introduces systematic approach for enhancing large language models (LLMs) to address Bangla AI mathematical challenges. Through the assessment of diverse LLM configurations, finetuning with specific datasets, and the implementation of Retrieval-Augmented Generation (RAG), we enhanced the model’s reasoning precision in a multilingual setting. Crucial discoveries indicate that customized prompting, dataset augmentation, and iterative reasoning improve the model’s efficiency regarding Olympiad-level mathematical challenges.

摘要。本研究介绍了一种系统性方法，用于增强大语言模型 (LLMs) 以应对孟加拉语 AI 数学挑战。通过评估多种大语言模型配置、使用特定数据集进行微调，以及实施检索增强生成 (Retrieval-Augmented Generation, RAG)，我们在多语言环境中提升了模型的推理精度。关键发现表明，定制化的提示、数据集增强和迭代推理提高了模型在奥林匹克级别数学挑战中的效率。

Keywords: Large Language Models (LLMs), Fine-Tuning, Bangla AI, Mathematical Reasoning, RetrievalAugmented Generation (RAG), Multilingual Setting, Customized Prompting, Dataset Augmentation, Iterative Reasoning, Olympiad-Level Challenges

关键词：大语言模型 (Large Language Models, LLMs)，微调 (Fine-Tuning)，Bangla AI，数学推理 (Mathematical Reasoning)，检索增强生成 (Retrieval-Augmented Generation, RAG)，多语言设置 (Multilingual Setting)，定制提示 (Customized Prompting)，数据集增强 (Dataset Augmentation)，迭代推理 (Iterative Reasoning)，奥林匹克级挑战 (Olympiad-Level Challenges)

1 Introduction

1 引言

In recent years, large language models (LLMs) have exhibited tremendous potential in tackling hard mathematics problems, including ones at Olympiad-level complexity. Omnimath is a global benchmark that covers several sub-domains and difficulty levels[1]. Odyssey math data measures mathematical problem-solving ability in LLMs[2]. These are just a few of the tools that researchers have made to test and improve model reasoning. Researchers have devised numerous methodologies and standards to evaluate and enhance the effectiveness of these models in mathematical reasoning.

近年来，大语言模型 (LLM) 在解决复杂数学问题方面展现了巨大的潜力，包括奥林匹克级别的难题。Omnimath 是一个全球基准，涵盖了多个子领域和难度级别 [1]。Odyssey 数学数据用于衡量大语言模型在数学问题解决能力方面的表现 [2]。这些只是研究人员为测试和改进模型推理能力而开发的众多工具中的一部分。研究人员设计了多种方法和标准来评估和提升这些模型在数学推理中的效果。

A significant method is the self-critique pipeline, implemented in the ChatGLM-Math model. This approach entails training a Math-Critique model based on the LLM to provide feedback signals. The LLM’s own outputs then undergo rejective fine-tuning and direct preference optimization for data acquisition. Experiments with the ChatGLM3-32B model showed that this pipeline greatly enhances the ability to solve mathematical problems while maintaining language skills, making it better than larger LLMs. [3]

一种重要的方法是自批判管道 (self-critique pipeline)，该管道在 ChatGLM-Math 模型中实现。这种方法需要基于大语言模型训练一个 Math-Critique 模型来提供反馈信号。然后，大语言模型自身的输出会经过拒绝微调 (rejective fine-tuning) 和直接偏好优化 (direct preference optimization) 以进行数据采集。使用 ChatGLM3-32B 模型的实验表明，该管道在保持语言能力的同时，极大地增强了解决数学问题的能力，使其优于更大的大语言模型。[3]

Another new technique involves the creation of MetaMath, a refined language model focused on mathematical reasoning. The MetaMath methodology initiates by formulating mathematical inquiries from several viewpoints, culminating in a novel dataset termed MetaMathQA. This dataset used to fine-tune the LLaMA-2 models. Studies using benchmarks like GSM8K and MATH show that MetaMath performs better than a number of open-source LLMs. For example, the MetaMath-7B model achieved a GSM8K precision of $66.5%$ and a MATH accuracy of $19.8%$ , which is much better than the performance of models of the same size. [4]

另一项新技术涉及 MetaMath 的创建，这是一个专注于数学推理的改进语言模型。MetaMath 方法首先从多个角度构建数学问题，最终形成一个名为 MetaMathQA 的新数据集。该数据集用于微调 LLaMA-2 模型。使用 GSM8K 和 MATH 等基准的研究表明，MetaMath 的表现优于许多开源大语言模型。例如，MetaMath-7B 模型在 GSM8K 上的精度达到了 $66.5%$，在 MATH 上的准确率为 $19.8%$，远优于同规模模型的表现。[4]

New standards have been established to assess the mathematical reasoning capabilities of LLMs, in addition to model-specific improvements. Mathador-LM is a dynamic benchmark derived from the Mathador game, wherein the goal is to get a target number through fundamental arithmetic operations applied to a specified set of base numbers, adhering to straightforward criteria. This test integrates ruleset interpretation, planning, and problem-solving, offering a thorough evaluation of LLMs’ mathematical reasoning ability. Research indicates that modern models do poorly on Mathador-LM, achieving scores much below those of average third graders, underscoring the necessity for enhancements in this domain.[5]

除了针对特定模型的改进外，还建立了新的标准来评估大语言模型的数学推理能力。Mathador-LM 是一个源自 Mathador 游戏的动态基准测试，其目标是通过对一组指定的基础数字应用基本算术运算来获得目标数字，并遵循简单的标准。该测试结合了规则集解释、规划和问题解决，全面评估了大语言模型的数学推理能力。研究表明，现代模型在 Mathador-LM 上表现不佳，得分远低于普通三年级学生的水平，这凸显了在这一领域进行改进的必要性。[5]

Additionally, Concept Math is a multilingual benchmark (English and Chinese) that evaluates concept-specific mathematical reasoning in LLMs. Concept Math organizes math problems in a way that lets them be tested at different levels of detail with concept-specific accuracy. This is different from traditional benchmarks that only measure general mathematical reasoning with average accuracy. Evaluations have shown that existing LLMs, while achieving high average accuracies on common metrics, exhibit significant performance variations among different mathematical ideas and may even struggle with basic ones.[6]

此外，Concept Math 是一个多语言基准（英语和中文），用于评估大语言模型在特定概念上的数学推理能力。Concept Math 以不同层次的细节组织数学问题，使其能够在特定概念上进行准确性测试。这与传统基准不同，传统基准仅通过平均准确性来衡量一般数学推理能力。评估表明，现有的大语言模型虽然在常见指标上取得了较高的平均准确性，但在不同数学概念之间表现出显著的性能差异，甚至可能在基础概念上表现不佳。[6]

Another paper [7] introduces the MCT Self-Refine (MCTSr) algorithm, a novel approach integrating Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to improve performance in complex mathematical reasoning tasks. The MCTSr algorithm addresses the challenges of accuracy and reliability in LLM-based reasoning by employing systematic exploration and heuristic self-refinement mechanisms. Another work presents Olympiad Bench, a bilingual multimodal benchmark consisting of 8,476 Olympiad-level mathematics and physics tasks, intended to assess the advanced reasoning abilities of Large Language Models (LLMs) and Large Multimodal Models (LMMs). Preliminary assessments indicate considerable difficulties, with the top-performing model, GPT-4V, attaining merely 17.97% accuracy, highlighting the benchmark’s stringency and its potential as a tool for furthering AGI research.[8]

另一篇论文 [7] 介绍了 MCT Self-Refine (MCTSr) 算法，这是一种将大语言模型 (LLMs) 与蒙特卡洛树搜索 (MCTS) 相结合的新方法，旨在提升复杂数学推理任务中的表现。MCTSr 算法通过系统化的探索和启发式自我优化机制，解决了基于大语言模型的推理在准确性和可靠性方面的挑战。另一项研究提出了 Olympiad Bench，这是一个双语多模态基准，包含 8,476 个奥林匹克级别的数学和物理任务，旨在评估大语言模型 (LLMs) 和大规模多模态模型 (LMMs) 的高级推理能力。初步评估显示，即使是表现最好的模型 GPT-4V，准确率也仅为 17.97%，凸显了该基准的严格性及其作为推动通用人工智能 (AGI) 研究工具的潜力。[8]

These advancements highlight the continuous endeavors to improve the mathematical problem-solving capabilities of LLMs via novel training methodologies and thorough assessment standards.Still, there are difficulties making sure LLMs can operate across several languages and problem types—especially for challenging tasks in non-English environments. Emphasizing the continuous improvement of these models, efforts including multilingual machine translation[9] and optimized frameworks like BEATS, which strengthens mathematical reasoning through back-verification and adaptive disambiguation[10], show this study builds on previous research by concentrating on fine-tuning LLMs to solve Bangla math problems. It does this by using a structured strategy of dataset enhancement, retrieval-augmented generation (RAG) [16], and personalized prompting to get the best results for Bangla AI math problems.

这些进展突显了通过新颖的训练方法和全面的评估标准来提升大语言模型数学问题解决能力的持续努力。然而，确保大语言模型能够在多种语言和问题类型中操作仍存在困难，尤其是在非英语环境中的复杂任务。强调这些模型的持续改进，包括多语言机器翻译[9]和优化框架如BEATS，通过反向验证和自适应消歧来增强数学推理[10]，表明本研究通过专注于微调大语言模型来解决孟加拉数学问题，建立在先前研究的基础上。它通过使用数据集增强、检索增强生成（RAG）[16]和个性化提示的结构化策略，为孟加拉AI数学问题获得最佳结果。

2 Methodology

2 方法论

2.1 Model Selection

2.1 模型选择

AI Mathematical Olympiad organized by AIMO. This model performs moderately well on our used dataset, but not up to expected level due to language difference.

由AIMO组织的AI数学奥林匹克竞赛。该模型在我们使用的数据集上表现中等，但由于语言差异，未达到预期水平。

We then focused on the Qwen2.5-7B-Instruct, Coder, and Math Instruct models[13]. Our Bangla evaluations of the Qwen2.5-7B-Math Instruct model for mathematical exercises were not enough satisfactory. Qwen2.5-7B-coder, which is transcribed for coding, also failed to generate codes for most Bangla languages. Manual inference for some selected problems revealed that they could not solve trivial problems.

我们随后重点关注了 Qwen2.5-7B-Instruct、Coder 和 Math Instruct 模型 [13]。我们对 Qwen2.5-7B-Math Instruct 模型在数学练习上的孟加拉语评估结果并不令人满意。Qwen2.5-7B-coder 是为编码设计的模型，也无法为大多数孟加拉语生成代码。通过对一些选定问题的手动推理，我们发现它们无法解决简单的问题。

However, Qwen2.5-7B-Instruct understood and solved Bangla maths problems well. The model’s versatility and Bangla language adaptability made it excellent for our purpose. It balanced mathematical thinking with Bangla language adaptation better for our purposes after extensive evaluations. Next, we tested on Qwen2.5-32B-Instruct-AWQ without fine-tuning or RAG, which makes a more improved accuracy over test set marking 77 out of 100 showing best of all. It implies that using model trained with large parameter ensures better performance.

然而，Qwen2.5-7B-Instruct 能够很好地理解并解决孟加拉数学问题。该模型的多功能性和对孟加拉语的适应性使其非常适合我们的需求。经过广泛的评估，它在数学思维和孟加拉语适应之间取得了更好的平衡。接下来，我们在未经微调或使用 RAG 的情况下测试了 Qwen2.5-32B-Instruct-AWQ，其在测试集上的准确率进一步提高，达到了 100 分中的 77 分，表现最佳。这表明使用大参数训练的模型能够确保更好的性能。

2.2 Datasets

2.2 数据集

– BDMO Dataset

– BDMO 数据集

This dataset is provided by Bangladesh Math Olympiad (BDMO). This dataset originally contains 209 Bangla Math Olympiad problems with their numerical solutions. We extended the dataset by adding step-by-step TIR solutions. We also translated the problem statements into English for further testing. There are also 100 test problems available in this kaggle contest here upon which the accuracy score is calculated.

该数据集由孟加拉国数学奥林匹克 (BDMO) 提供。该数据集最初包含 209 个孟加拉语数学奥林匹克问题及其数值解。我们通过添加逐步的 TIR 解决方案扩展了数据集。我们还将问题陈述翻译成英文以进行进一步测试。在此 Kaggle 竞赛中还有 100 个测试问题可用，用于计算准确率得分。

– Translated Numina CoT Dataset:

– 翻译后的 Numina CoT 数据集：

This dataset comprises 0.8M problems formatted using the Chain of Thought reasoning technique, focusing on step-by-step solutions which were used to train NuminaMath7B-CoT in AI-MO Competition. For our use case, these were translated using gemini1.5-flash model into Bangla for fine-tuning. It supports the training of models capable of multi-step reasoning. This contains both problem statement and problem solution step description in Bangla.

该数据集包含80万个使用思维链推理技术（Chain of Thought）格式化的问题，专注于逐步解决方案，这些问题用于在AI-MO竞赛中训练NuminaMath7B-CoT模型。在我们的使用场景中，这些问题通过gemini1.5-flash模型翻译成孟加拉语以进行微调。它支持训练能够进行多步推理的模型。该数据集包含孟加拉语的问题陈述和问题解决步骤描述。

– Translated Numina TIR Dataset:

翻译后的Numina TIR数据集：

A foundational dataset with around 72K math competition problems ,with each solution generated by GPT-4 with tool integrated reasoning (TIR)[15]. Next, these were also translated likewise for our convenience.

一个包含约 72K 数学竞赛问题的基础数据集，每个问题的解决方案均由 GPT-4 通过工具集成推理 (TIR) [15] 生成。为了方便使用，这些问题也被同样翻译了。

– Synthetic Problems Dataset:

– 合成问题数据集

This dataset includes a variety of synthetic mathematical problems generated using the GPT4o API based on BDMO dataset. These problems ensure multiple similar problems with paraphrasing in order to ensure diversity while training for each category.

该数据集包含使用 GPT4o API 基于 BDMO 数据集生成的各种合成数学问题。这些问题通过改写确保每个类别在训练时具有多样性，同时包含多个相似问题。

These datasets can be found in the Dataset Availability section.

这些数据集可以在数据集可用性部分找到。

2.3 Preprocessing

2.3 预处理

We underwent a comprehensive data preprocessing method that integrated RetrievalAugmented Generation (RAG) with keyword search to solve the challenges associated with Bangla Math Olympiad. We initially identified significant keywords from each mathematical issue manually, focusing on terms that focus on the essential concepts and categories of the problems. Utilizing these keywords, we conducted similarity searches inside our problem dataset to identify issues with analogous structures or themes. The recognized similar problems and their solutions as Tool Integrated Reasoning were then incorporated as fewshot instances in the model’s prompts, providing contextual assistance and enhancing the model’s problem-solving capabilities. By integrating RAG with keyword-based retrieval, we enhanced the model’s ability to comprehend and address Bangla Math Olympiad problems effectively. This approach associated contextual examples with each problem, assists in model’s effectiveness in solving Math problems.

我们采用了一种综合的数据预处理方法，结合了检索增强生成 (Retrieval-Augmented Generation, RAG) 和关键词搜索，以解决与孟加拉数学奥林匹克相关的挑战。我们首先手动从每个数学问题中识别出重要的关键词，重点关注那些涉及问题核心概念和类别的术语。利用这些关键词，我们在问题数据集中进行了相似性搜索，以识别具有类似结构或主题的问题。随后，将识别出的相似问题及其解决方案作为工具集成推理 (Tool Integrated Reasoning) 纳入模型的提示中，作为少样本实例，提供上下文支持并增强模型的问题解决能力。通过将 RAG 与基于关键词的检索相结合，我们提升了模型理解和解决孟加拉数学奥林匹克问题的能力。这种方法将上下文示例与每个问题相关联，有助于提高模型解决数学问题的效果。

2.4 Augmentation

2.4 增强

We expanded and diversified our training dataset in order to enhance the capabilities of our model with the utilization of data augmentation techniques, to solve the challenges presented by the Bangla Math Olympiad. With the help of OpenAI’s GPT-4.0, we generated five paraphrased and similar versions for each of the original problems. These ensure that the original difficulties’ complexities and nuances were maintained. After the implementation of this modification, heterogeneity was introduced into the dataset, which involve manual categorization over the problem sets. By exposing the model to a greater variety of problem statements, we were able to improve its flexibility and robustness through the process of addressing a variety of mathematical challenges in Bangla .

为了增强我们模型的能力，我们扩展并多样化训练数据集，利用数据增强技术来解决孟加拉数学奥林匹克带来的挑战。在OpenAI的GPT-4.0的帮助下，我们为每个原始问题生成了五个改写和相似的版本。这些版本确保了原始问题的复杂性和细微差别得以保留。在实施这一修改后，数据集中引入了异质性，这涉及对问题集的手动分类。通过让模型接触更多种类的问题陈述，我们能够通过解决各种孟加拉数学挑战的过程来提高其灵活性和鲁棒性。

2.5 Fine-tuning

2.5 微调

Fine-tuning of the Qwen-2.5-7B-Instruct model was done using 1x NVIDIA H100 NVL GPU for efficient training. The model underwent three phases to improve its reasoning for Bangla AI mathematical problems and solutions. The first phase involved fine-tuning the model using the translated Numina TIR dataset, which had around 72,000 problems and step-by-step python solutions. The second stage involved using the Numina CoT dataset, which included over around 800,000 CoT problem solution pairs. For our use case, we used 250,000 of those. Then, we used GPT4o API for creating synthetic datasets to increase the model’s adaptability and versatility. The final stage focused on paraphrasing and augmentation, rephrasing problem sets to simulate different mathematical presentations. The model was evaluated using BDMO dataset and test dataset at each level to examine accuracy improvements and fine-tuning stage effects. Several rounds of model training improved the model’s ability to understand complex reasoning processes and improve its Bangla AI mathematics comprehension and response formulation. All the models finetuned in each step can be found in the Model Availability section.

Qwen-2.5-7B-Instruct 模型的微调使用了 1 个 NVIDIA H100 NVL GPU 进行高效训练。该模型经历了三个阶段，以提高其对孟加拉语 AI 数学问题和解决方案的推理能力。第一阶段使用翻译后的 Numina TIR 数据集对模型进行微调，该数据集包含约 72,000 个问题及其逐步的 Python 解决方案。第二阶段使用了 Numina CoT 数据集，该数据集包含超过 80 万对 CoT 问题解决方案。在我们的用例中，我们使用了其中的 25 万对。然后，我们使用 GPT4o API 创建合成数据集，以增加模型的适应性和多功能性。最后阶段侧重于改写和增强，重新表述问题集以模拟不同的数学呈现方式。在每个阶段，使用 BDMO 数据集和测试数据集对模型进行评估，以检查准确性的提高和微调阶段的效果。经过多轮模型训练，模型理解复杂推理过程的能力得到了提升，并改善了其对孟加拉语 AI 数学的理解和响应制定能力。每个步骤中微调的所有模型都可以在模型可用性部分找到。

2.6 Model Architecture and Flow

2.6 模型架构与流程

Figure 1 illustrates the architecture of our system developed to analyze the Bangla mathematical problem. The system employs a Large Language Model (LLM) in conjunction with TIR agents and self-consistency [14]. The TIR agents generate python code iteratively until the correct solution is found and then executes them in REPL (Read Eval Print Loop). According to self-consistency prompting, final answer is chosen by majority voting among the differently generated solutions.

图 1 展示了我们开发的用于分析孟加拉数学问题的系统架构。该系统采用大语言模型 (LLM) 结合 TIR 智能体和自一致性 [14]。TIR 智能体迭代生成 Python 代码，直到找到正确的解决方案，然后在 REPL (Read Eval Print Loop) 中执行它们。根据自一致性提示，最终答案是通过对不同生成的解决方案进行多数投票来选择的。

The procedure starts by accepting Bangla mathematics questions as input. Keywords are derived analyzing the training data and a Retrieval-Augmented Generation (RAG) technique is employed to identify analogous instances.They are basically categorized as similar problems matching their keywords and then used these similar problems for few shot prompts. These examples assist the LLM in producing methodical solutions. Multiple TIR agents develop distinct solution paths to investigate several methods for addressing the challenge, subsequently testing them in a Python code execution environment.

该过程首先接受孟加拉语数学问题作为输入。通过分析训练数据提取关键词，并采用检索增强生成 (Retrieval-Augmented Gener

[论文翻译]端到端孟加拉语 AI 解决数学奥林匹克问题基准：利用大语言模型的集成方法

原文地址：https://arxiv.org/pdf/2501.04425