End-to-End Bangla AI for Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
端到端孟加拉语 AI 解决数学奥林匹克问题基准:利用大语言模型的集成方法
H.M.Shadman Tabib and Jaber Ahmed Deedar
H.M.Shadman Tabib 和 Jaber Ahmed Deedar
Department of Computer Science and Engineering,Bangladesh University of Engineering and Technology Dhaka,Bangladesh
计算机科学与工程系,孟加拉国工程技术大学,达卡,孟加拉国
Abstract. This work introduces systematic approach for enhancing large language models (LLMs) to address Bangla AI mathematical challenges. Through the assessment of diverse LLM configurations, finetuning with specific datasets, and the implementation of Retrieval-Augmented Generation (RAG), we enhanced the model’s reasoning precision in a multilingual setting. Crucial discoveries indicate that customized prompting, dataset augmentation, and iterative reasoning improve the model’s efficiency regarding Olympiad-level mathematical challenges.
摘要。本研究介绍了一种系统性方法,用于增强大语言模型 (LLMs) 以应对孟加拉语 AI 数学挑战。通过评估多种大语言模型配置、使用特定数据集进行微调,以及实施检索增强生成 (Retrieval-Augmented Generation, RAG),我们在多语言环境中提升了模型的推理精度。关键发现表明,定制化的提示、数据集增强和迭代推理提高了模型在奥林匹克级别数学挑战中的效率。
Keywords: Large Language Models (LLMs), Fine-Tuning, Bangla AI, Mathematical Reasoning, RetrievalAugmented Generation (RAG), Multilingual Setting, Customized Prompting, Dataset Augmentation, Iterative Reasoning, Olympiad-Level Challenges
关键词:大语言模型 (Large Language Models, LLMs),微调 (Fine-Tuning),Bangla AI,数学推理 (Mathematical Reasoning),检索增强生成 (Retrieval-Augmented Generation, RAG),多语言设置 (Multilingual Setting),定制提示 (Customized Prompting),数据集增强 (Dataset Augmentation),迭代推理 (Iterative Reasoning),奥林匹克级挑战 (Olympiad-Level Challenges)
1 Introduction
1 引言
In recent years, large language models (LLMs) have exhibited tremendous potential in tackling hard mathematics problems, including ones at Olympiad-level complexity. Omnimath is a global benchmark that covers several sub-domains and difficulty levels[1]. Odyssey math data measures mathematical problem-solving ability in LLMs[2]. These are just a few of the tools that researchers have made to test and improve model reasoning. Researchers have devised numerous methodologies and standards to evaluate and enhance the effectiveness of these models in mathematical reasoning.
近年来,大语言模型 (LLM) 在解决复杂数学问题方面展现了巨大的潜力,包括奥林匹克级别的难题。Omnimath 是一个全球基准,涵盖了多个子领域和难度级别 [1]。Odyssey 数学数据用于衡量大语言模型在数学问题解决能力方面的表现 [2]。这些只是研究人员为测试和改进模型推理能力而开发的众多工具中的一部分。研究人员设计了多种方法和标准来评估和提升这些模型在数学推理中的效果。
A significant method is the self-critique pipeline, implemented in the ChatGLM-Math model. This approach entails training a Math-Critique model based on the LLM to provide feedback signals. The LLM’s own outputs then undergo rejective fine-tuning and direct preference optimization for data acquisition. Experiments with the ChatGLM3-32B model showed that this pipeline greatly enhances the ability to solve mathematical problems while maintaining language skills, making it better than larger LLMs. [3]
一种重要的方法是自批判管道 (self-critique pipeline),该管道在 ChatGLM-Math 模型中实现。这种方法需要基于大语言模型训练一个 Math-Critique 模型来提供反馈信号。然后,大语言模型自身的输出会经过拒绝微调 (rejective fine-tuning) 和直接偏好优化 (direct preference optimization) 以进行数据采集。使用 ChatGLM3-32B 模型的实验表明,该管道在保持语言能力的同时,极大地增强了解决数学问题的能力,使其优于更大的大语言模型。[3]
Another new technique involves the creation of MetaMath, a refined language model focused on mathematical reasoning. The MetaMath methodology initiates by formulating mathematical inquiries from several viewpoints, culminating in a novel dataset termed MetaMathQA. This dataset used to fine-tune the LLaMA-2 models. Studies using benchmarks like GSM8K and MATH show that MetaMath performs better than a number of open-source LLMs. For example, the MetaMath-7B model achieved a GSM8K precision of $66.5%$ and a MATH accuracy of $19.8%$ , which is much better than the performance of models of the same size. [4]
另一项新技术涉及 MetaMath 的创建,这是一个专注于数学推理的改进语言模型。MetaMath 方法首先从多个角度构建数学问题,最终形成一个名为 MetaMathQA 的新数据集。该数据集用于微调 LLaMA-2 模型。使用 GSM8K 和 MATH 等基准的研究表明,MetaMath 的表现优于许多开源大语言模型。例如,MetaMath-7B 模型在 GSM8K 上的精度达到了 $66.5%$,在 MATH 上的准确率为 $19.8%$,远优于同规模模型的表现。[4]
New standards have been established to assess the mathematical reasoning capabilities of LLMs, in addition to model-specific improvements. Mathador-LM is a dynamic benchmark derived from the Mathador game, wherein the goal is to get a target number through fundamental arithmetic operations applied to a specified set of base numbers, adhering to straightforward criteria. This test integrates ruleset interpretation, planning, and problem-solving, offering a thorough evaluation of LLMs’ mathematical reasoning ability. Research indicates that modern models do poorly on Mathador-LM, achieving scores much below those of average third graders, underscoring the necessity for enhancements in this domain.[5]
除了针对特定模型的改进外,还建立了新的标准来评估大语言模型的数学推理能力。Mathador-LM 是一个源自 Mathador 游戏的动态基准测试,其目标是通过对一组指定的基础数字应用基本算术运算来获得目标数字,并遵循简单的标准。该测试结合了规则集解释、规划和问题解决,全面评估了大语言模型的数学推理能力。研究表明,现代模型在 Mathador-LM 上表现不佳,得分远低于普通三年级学生的水平,这凸显了在这一领域进行改进的必要性。[5]
Additionally, Concept Math is a multilingual benchmark (English and Chinese) that evaluates concept-specific mathematical reasoning in LLMs. Concept Math organizes math problems in a way that lets them be tested at different levels of detail with concept-specific accuracy. This is different from traditional benchmarks that only measure general mathematical reasoning with average accuracy. Evaluations have shown that existing LLMs, while achieving high average accuracies on common metrics, exhibit significant performance variations among different mathematical ideas and may even struggle with basic ones.[6]
此外,Concept Math 是一个多语言基准(英语和中文),用于评估大语言模型在特定概念上的数学推理能力。Concept Math 以不同层次的细节组织数学问题,使其能够在特定概念上进行准确性测试。这与传统基准不同,传统基准仅通过平均准确性来衡量一般数学推理能力。评估表明,现有的大语言模型虽然在常见指标上取得了较高的平均准确性,但在不同数学概念之间表现出显著的性能差异,甚至可能在基础概念上表现不佳。[6]
Another paper [7] introduces the MCT Self-Refine (MCTSr) algorithm, a novel approach integrating Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to improve performance in complex mathematical reasoning tasks. The MCTSr algorithm addresses the challenges of accuracy and reliability in LLM-based reasoning by employing systematic exploration and heuristic self-refinement mechanisms. Another work presents Olympiad Bench, a bilingual multimodal benchmark consisting of 8,476 Olympiad-level mathematics and physics tasks, intended to assess the advanced reasoning abilities of Large Language Models (LLMs) and Large Multimodal Models (LMMs). Preliminary assessments indicate considerable difficulties, with the top-performing model, GPT-4V, attaining merely 17.97% accuracy, highlighting the benchmark’s stringency and its potential as a tool for furthering AGI research.[8]
另一篇论文 [7] 介绍了 MCT Self-Refine (MCTSr) 算法,这是一种将大语言模型 (LLMs) 与蒙特卡洛树搜索 (MCTS) 相结合的新方法,旨在提升复杂数学推理任务中的表现。MCTSr 算法通过系统化的探索和启发式自我优化机制,解决了基于大语言模型的推理在准确性和可靠性方面的挑战。另一项研究提出了 Olympiad Bench,这是一个双语多模态基准,包含 8,476 个奥林匹克级别的数学和物理任务,旨在评估大语言模型 (LLMs) 和大规模多模态模型 (LMMs) 的高级推理能力。初步评估显示,即使是表现最好的模型 GPT-4V,准确率也仅为 17.97%,凸显了该基准的严格性及其作为推动通用人工智能 (AGI) 研究工具的潜力。[8]
These advancements highlight the continuous endeavors to improve the mathematical problem-solving capabilities of LLMs via novel training methodologies and thorough assessment standards.Still, there are difficulties making sure LLMs can operate across several languages and problem types—especially for challenging tasks in non-English environments. Emphasizing the continuous improvement of these models, efforts including multilingual machine translation[9] and optimized frameworks like BEATS, which strengthens mathematical reasoning through back-verification and adaptive disambiguation[10], show this study builds on previous research by concentrating on fine-tuning LLMs to solve Bangla math problems. It does this by using a structured strategy of dataset enhancement, retrieval-augmented generation (RAG) [16], and personalized prompting to get the best results for Bangla AI math problems.
这些进展突显了通过新颖的训练方法和全面的评估标准来提升大语言模型数学问题解决能力的持续努力。然而,确保大语言模型能够在多种语言和问题类型中操作仍存在困难,尤其是在非英语环境中的复杂任务。强调这些模型的持续改进,包括多语言机器翻译[9]和优化框架如BEATS,通过反向验证和自适应消歧来增强数学推理[10],表明本研究通过专注于微调大语言模型来解决孟加拉数学问题,建立在先前研究的基础上。它通过使用数据集增强、检索增强生成(RAG)[16]和个性化提示的结构化策略,为孟加拉AI数学问题获得最佳结果。
2 Methodology
2 方法论
2.1 Model Selection
2.1 模型选择
AI Mathematical Olympiad organized by AIMO. This model performs moderately well on our used dataset, but not up to expected level due to language difference.
由AIMO组织的AI数学奥林匹克竞赛。该模型在我们使用的数据集上表现中等,但由于语言差异,未达到预期水平。
We then focused on the Qwen2.5-7B-Instruct, Coder, and Math Instruct models[13]. Our Bangla evaluations of the Qwen2.5-7B-Math Instruct model for mathematical exercises were not enough satisfactory. Qwen2.5-7B-coder, which is transcribed for coding, also failed to generate codes for most Bangla languages. Manual inference for some selected problems revealed that they could not solve trivial problems.
我们随后重点关注了 Qwen2.5-7B-Instruct、Coder 和 Math Instruct 模型 [13]。我们对 Qwen2.5-7B-Math Instruct 模型在数学练习上的孟加拉语评估结果并不令人满意。Qwen2.5-7B-coder 是为编码设计的模型,也无法为大多数孟加拉语生成代码。通过对一些选定问题的手动推理,我们发现它们无法解决简单的问题。
However, Qwen2.5-7B-Instruct understood and solved Bangla maths problems well. The model’s versatility and Bangla language adaptability made it excellent for our purpose. It balanced mathematical thinking with Bangla language adaptation better for our purposes after extensive evaluations. Next, we tested on Qwen2.5-32B-Instruct-AWQ without fine-tuning or RAG, which makes a more improved accuracy over test set marking 77 out of 100 showing best of all. It implies that using model trained with large parameter ensures better performance.
然而,Qwen2.5-7B-Instruct 能够很好地理解并解决孟加拉数学问题。该模型的多功能性和对孟加拉语的适应性使其非常适合我们的需求。经过广泛的评估,它在数学思维和孟加拉语适应之间取得了更好的平衡。接下来,我们在未经微调或使用 RAG 的情况下测试了 Qwen2.5-32B-Instruct-AWQ,其在测试集上的准确率进一步提高,达到了 100 分中的 77 分,表现最佳。这表明使用大参数训练的模型能够确保更好的性能。
2.2 Datasets
2.2 数据集
– BDMO Dataset
– BDMO 数据集
This dataset is provided by Bangladesh Math Olympiad (BDMO). This dataset originally contains 209 Bangla Math Olympiad problems with their numerical solutions. We extended the dataset by adding step-by-step TIR solutions. We also translated the problem statements into English for further testing. There are also 100 test problems available in this kaggle contest here upon which the accuracy score is calculated.
该数据集由孟加拉国数学奥林匹克 (BDMO) 提供。该数据集最初包含 209 个孟加拉语数学奥林匹克问题及其数值解。我们通过添加逐步的 TIR 解决方案扩展了数据集。我们还将问题陈述翻译成英文以进行进一步测试。在此 Kaggle 竞赛中还有 100 个测试问题可用,用于计算准确率得分。
– Translated Numina CoT Dataset:
– 翻译后的 Numina CoT 数据集:
This dataset comprises 0.8M problems formatted using the Chain of Thought reasoning technique, focusing on step-by-step solutions which were used to train NuminaMath7B-CoT in AI-MO Competition. For our use case, these were translated using gemini1.5-flash model into Bangla for fine-tuning. It supports the training of models capable of multi-step reasoning. This contains both problem statement and problem solution step description in Bangla.
该数据集包含80万个使用思维链推理技术(Chain of Thought)格式化的问题,专注于逐步解决方案,这些问题用于在AI-MO竞赛中训练NuminaMath7B-CoT模型。在我们的使用场景中,这些问题通过gemini1.5-flash模型翻译成孟加拉语以进行微调。它支持训练能够进行多步推理的模型。该数据集包含孟加拉语的问题陈述和问题解决步骤描述。
– Translated Numina TIR Dataset:
翻译后的Numina TIR数据集:
A foundational dataset with around 72K math competition problems ,with each solution generated by GPT-4 with tool integrated reasoning (TIR)[15]. Next, these were also translated likewise for our convenience.
一个包含约 72K 数学竞赛问题的基础数据集,每个问题的解决方案均由 GPT-4 通过工具集成推理 (TIR) [15] 生成。为了方便使用,这些问题也被同样翻译了。
– Synthetic Problems Dataset:
– 合成问题数据集
This dataset includes a variety of synthetic mathematical problems generated using the GPT4o API based on BDMO dataset. These problems ensure multiple similar problems with paraphrasing in order to ensure diversity while training for each category.
该数据集包含使用 GPT4o API 基于 BDMO 数据集生成的各种合成数学问题。这些问题通过改写确保每个类别在训练时具有多样性,同时包含多个相似问题。
These datasets can be found in the Dataset Availability section.
这些数据集可以在数据集可用性部分找到。
2.3 Preprocessing
2.3 预处理
We underwent a comprehensive data preprocessing method that integrated RetrievalAugmented Generation (RAG) with keyword search to solve the challenges associated with Bangla Math Olympiad. We initially identified significant keywords from each mathematical issue manually, focusing on terms that focus on the essential concepts and categories of the problems. Utilizing these keywords, we conducted similarity searches inside our problem dataset to identify issues with analogous structures or themes. The recognized similar problems and their solutions as Tool Integrated Reasoning were then incorporated as fewshot instances in the model’s prompts, providing contextual assistance and enhancing the model’s problem-solving capabilities. By integrating RAG with keyword-based retrieval, we enhanced the model’s ability to comprehend and address Bangla Math Olympiad problems effectively. This approach associated contextual examples with each problem, assists in model’s effectiveness in solving Math problems.
我们采用了一种综合的数据预处理方法,结合了检索增强生成 (Retrieval-Augmented Generation, RAG) 和关键词搜索,以解决与孟加拉数学奥林匹克相关的挑战。我们首先手动从每个数学问题中识别出重要的关键词,重点关注那些涉及问题核心概念和类别的术语。利用这些关键词,我们在问题数据集中进行了相似性搜索,以识别具有类似结构或主题的问题。随后,将识别出的相似问题及其解决方案作为工具集成推理 (Tool Integrated Reasoning) 纳入模型的提示中,作为少样本实例,提供上下文支持并增强模型的问题解决能力。通过将 RAG 与基于关键词的检索相结合,我们提升了模型理解和解决孟加拉数学奥林匹克问题的能力。这种方法将上下文示例与每个问题相关联,有助于提高模型解决数学问题的效果。
2.4 Augmentation
2.4 增强
We expanded and diversified our training dataset in order to enhance the capabilities of our model with the utilization of data augmentation techniques, to solve the challenges presented by the Bangla Math Olympiad. With the help of OpenAI’s GPT-4.0, we generated five paraphrased and similar versions for each of the original problems. These ensure that the original difficulties’ complexities and nuances were maintained. After the implementation of this modification, heterogeneity was introduced into the dataset, which involve manual categorization over the problem sets. By exposing the model to a greater variety of problem statements, we were able to improve its flexibility and robustness through the process of addressing a variety of mathematical challenges in Bangla .
为了增强我们模型的能力,我们扩展并多样化训练数据集,利用数据增强技术来解决孟加拉数学奥林匹克带来的挑战。在OpenAI的GPT-4.0的帮助下,我们为每个原始问题生成了五个改写和相似的版本。这些版本确保了原始问题的复杂性和细微差别得以保留。在实施这一修改后,数据集中引入了异质性,这涉及对问题集的手动分类。通过让模型接触更多种类的问题陈述,我们能够通过解决各种孟加拉数学挑战的过程来提高其灵活性和鲁棒性。
2.5 Fine-tuning
2.5 微调
Fine-tuning of the Qwen-2.5-7B-Instruct model was done using 1x NVIDIA H100 NVL GPU for efficient training. The model underwent three phases to improve its reasoning for Bangla AI mathematical problems and solutions. The first phase involved fine-tuning the model using the translated Numina TIR dataset, which had around 72,000 problems and step-by-step python solutions. The second stage involved using the Numina CoT dataset, which included over around 800,000 CoT problem solution pairs. For our use case, we used 250,000 of those. Then, we used GPT4o API for creating synthetic datasets to increase the model’s adaptability and versatility. The final stage focused on paraphrasing and augmentation, rephrasing problem sets to simulate different mathematical presentations. The model was evaluated using BDMO dataset and test dataset at each level to examine accuracy improvements and fine-tuning stage effects. Several rounds of model training improved the model’s ability to understand complex reasoning processes and improve its Bangla AI mathematics comprehension and response formulation. All the models finetuned in each step can be found in the Model Availability section.
Qwen-2.5-7B-Instruct 模型的微调使用了 1 个 NVIDIA H100 NVL GPU 进行高效训练。该模型经历了三个阶段,以提高其对孟加拉语 AI 数学问题和解决方案的推理能力。第一阶段使用翻译后的 Numina TIR 数据集对模型进行微调,该数据集包含约 72,000 个问题及其逐步的 Python 解决方案。第二阶段使用了 Numina CoT 数据集,该数据集包含超过 80 万对 CoT 问题解决方案。在我们的用例中,我们使用了其中的 25 万对。然后,我们使用 GPT4o API 创建合成数据集,以增加模型的适应性和多功能性。最后阶段侧重于改写和增强,重新表述问题集以模拟不同的数学呈现方式。在每个阶段,使用 BDMO 数据集和测试数据集对模型进行评估,以检查准确性的提高和微调阶段的效果。经过多轮模型训练,模型理解复杂推理过程的能力得到了提升,并改善了其对孟加拉语 AI 数学的理解和响应制定能力。每个步骤中微调的所有模型都可以在模型可用性部分找到。
2.6 Model Architecture and Flow
2.6 模型架构与流程
Figure 1 illustrates the architecture of our system developed to analyze the Bangla mathematical problem. The system employs a Large Language Model (LLM) in conjunction with TIR agents and self-consistency [14]. The TIR agents generate python code iteratively until the correct solution is found and then executes them in REPL (Read Eval Print Loop). According to self-consistency prompting, final answer is chosen by majority voting among the differently generated solutions.
图 1 展示了我们开发的用于分析孟加拉数学问题的系统架构。该系统采用大语言模型 (LLM) 结合 TIR 智能体和自一致性 [14]。TIR 智能体迭代生成 Python 代码,直到找到正确的解决方案,然后在 REPL (Read Eval Print Loop) 中执行它们。根据自一致性提示,最终答案是通过对不同生成的解决方案进行多数投票来选择的。
The procedure starts by accepting Bangla mathematics questions as input. Keywords are derived analyzing the training data and a Retrieval-Augmented Generation (RAG) technique is employed to identify analogous instances.They are basically categorized as similar problems matching their keywords and then used these similar problems for few shot prompts. These examples assist the LLM in producing methodical solutions. Multiple TIR agents develop distinct solution paths to investigate several methods for addressing the challenge, subsequently testing them in a Python code execution environment.
该过程首先接受孟加拉语数学问题作为输入。通过分析训练数据提取关键词,并采用检索增强生成 (Retrieval-Augmented Generation, RAG) 技术来识别类似实例。这些实例基本上根据关键词匹配为相似问题,然后用于少样本提示。这些示例帮助大语言模型生成系统化的解决方案。多个 TIR 智能体开发不同的解决路径,以探索多种应对挑战的方法,随后在 Python语言 代码执行环境中进行测试。

Fig. 1. Model Architecture
图 1: 模型架构
Upon the failure of a solution path, the system re-evaluates the problem to rectify any inaccuracies. Ultimately, the system aggregates the agents’ results, employing a voting mechanism to choose the optimal option. This methodology guarantees precision and dependability through the integration of reasoning, verification, and consensus techniques.
在解决方案路径失败时,系统会重新评估问题以纠正任何不准确之处。最终,系统通过投票机制汇总智能体的结果,选择最佳选项。该方法通过整合推理、验证和共识技术,确保了精确性和可靠性。
2.7 Model Inference and Prompting
2.7 模型推理与提示
The TIR agents were tested using system prompts containing no instruction and prompts containing tailored instructions. These instructions were based on the problems’ categories which were manually curated investigating the training dataset. Such of these can be found here. Moreover, we tested both zero shot and few shot prompting techniques. For few shot examples, we used keyword search RAG where problems are categorized matching relevant keywords.
TIR智能体在测试时使用了不含指令的系统提示和包含定制指令的提示。这些指令基于问题的类别,这些类别是通过手动调查训练数据集来整理的。部分示例可以在这里找到。此外,我们还测试了零样本和少样本提示技术。对于少样本示例,我们使用了关键词搜索RAG(检索增强生成),其中问题通过匹配相关关键词进行分类。
Prompts Used
使用的提示词
Base Prompt
基础提示
Here is a math problem in Bengali: {problem}
这里有一个孟加拉语的数学问题:{problem}
The answer is a non-negative integer. Please reason step by step to solve the problem above. Provide Python code to verify your reasoning.
答案是一个非负整数。请逐步推理以解决上述问题。提供 Python语言 代码来验证你的推理。
Put your final integer answer within \boxed{}.
将你的最终整数答案放在 \boxed{} 中。
Tailored Instruction for Number Theory Problems
数论问题的定制指令
Please write brute force solution codes to answer.
请编写暴力破解的解决方案代码来回答。
Translation Prompt
翻译提示
Please translate the problem to English first. Then solve the problem.
请先将问题翻译成英文。然后解决问题。
Advanced Prompt
高级提示
Step-by-Step Guidance Prompt
逐步引导提示
Table 1. Comparison of GPT-4o Configurations for BDMO dataset problems. Here ”basic” means direct prompting without using any tool and ”TIR” denotes model is coupled with tool integrated reasoning.
表 1. GPT-4o 配置在 BDMO 数据集问题上的比较。其中“basic”表示直接提示而不使用任何工具,“TIR”表示模型与工具集成推理结合。
| 模型配置 | 问题语言 | 推理语言 | 结果 |
|---|---|---|---|
| GPT4o-Basic | Bangla | Bangla | 50 / 100 |
| GPT4o-Basic | English | Bangla | 62 / 100 |
| GPT4o-Basic | Bangla | English | 64* / 100 |
| GPT4o-Basic | English | English | 60 / 100 |
| GPT4o - TIR | Bangla | Bangla | 63 / 100 |
| GPT4o - TIR | English | Bangla | 61 / 100 |
| GPT4o - TIR | Bangla | English | 67 / 100 |
| GPT4o - TIR | English | English | 71** / 100 |
Table 2. Comparison of GPT-4o Configurations for test problems. Here ”basic” means direct prompting without using any tool and ”TIR” denotes model is coupled with tool integrated reasoning.
表 2: GPT-4o 配置在测试问题中的对比。这里的“basic”表示直接提示而不使用任何工具,“TIR”表示模型与工具集成推理结合。
| 模型配置 | 问题语言 | 推理语言 | 结果 |
|---|---|---|---|
| Deepseek-Math-7B-Instruct-Basic | Bangla | Bangla | 28/100 |
Table 3. Comparison of Deep Seek Math model for test problems. Here ”basic” means direct prompting without using any tool. Table 4. Benchmark Results for Numina-7B-TIR Model. Here, ”no instructions” implies no system prompt was used during inference, while ”instructions given” denotes tailored system prompts.
表 3: Deep Seek Math 模型在测试问题上的比较。这里的“基本”是指不使用任何工具的直接提示。
表 4: Numina-7B-TIR 模型的基准测试结果。这里的“无指令”意味着在推理过程中没有使用系统提示,而“给定指令”表示使用了定制的系统提示。
| TIR 智能体数量 | 是否给定指令 | 语言 | 结果 |
|---|---|---|---|
| 42 | 否 | 孟加拉语 | 59 / 100 |
| 10 | 否 | 英语 | 62 / 100 |
| 13 | 否 | 英语 | 53 / 100 |
| 10 | 是 | 英语 | 64* / 100 |
| 5 | 是 | 英语 | 62 / 100 |
| 5 | 是 | 孟加拉语 | 50 / 100 |
| 模型配置 | 问题语言 | 样本数 | 深度 | 分数 |
|---|---|---|---|---|
| Qwen2.5-7B-Math-Instruct | 英文 | 10 | 10 | 59 |
| Qwen2.5-7B-Math-Instruct | 英文 | 5 | 10 | 57 |
| Qwen2.5-7B-CoderInstruct | 英文 | 5 | 65 | |
| Qwen2.5-7B-Instruct | 英文 | 5 | 5 | 70 |
| Qwen2.5-7B-Instruct | 英文 | 50 | 6 | 70 |
| Qwen2.5-7B-Instruct | 孟加拉语 | 50 | 6 | 68 |
| Qwen2.5-7B-Instruct (先翻译成英文的提示) | 孟加拉语 | 50 | 6 | 64 |
| Qwen2.5-32B-Instruct-AWQ | 孟加拉语 | 10 | 4 | *44 |
Table 5. Performance Comparison of Qwen2.5 Family Models for Solving Bangla AI Math Problems Table 6. Performance Comparison of Fine-Tuned Models of Qwen2.5-7B-Instruct with Different Datasets for Solving Bangla AI Math Problems
表 5. Qwen2.5 系列模型在解决孟加拉语 AI 数学问题上的性能对比
表 6. Qwen2.5-7B-Instruct 微调模型在不同数据集上解决孟加拉语 AI 数学问题的性能对比
| 数据集 | 数据量 | 训练轮数 | 数据增强 | 得分 |
|---|---|---|---|---|
| TIR | 72K | 1 | 否 | 70 |
| CoT | 0.25M | 1 | 否 | 40 |
| TIR | 72K TIR | 3 | 否 | 70 |
| CoT + TIR | 72K TIR + 0.25M CoT | 5 | s | 68 |
| CoT + TIR + RAG | 72K TIR + 0.25M CoT | 5 | 是 | 71* |
| TIR + RAG | 72K TIR | 5 | 是 | 71* |
3 Results
3 结果
The study aimed to improve Bangla AI mathematics performance by finding the optimum model configuration, tweaking it, and applying RAG. The initial stage involved testing GPT-4o for reasoning and linguistic proficiency. GPT4o-TIR had the highest accuracy of 130 out of 209 in BDMO dataset. When Bangla was the main language, the model performed worse, suggesting the need for more advanced and versatile model. The second stage examined how number of TIR agents and instructional setups affected the Numina7B-TIR model’s performance. The maximum score was 64 out of 100 (here we used the test dataset) with 10 agents per problem and clear English direction as prompt. Next, we tested Qwen2.5-7B-Instruct. It was more versatile and effective arrangement, earning 70 with 5 agents and 5 depth. In the final testing phase, data augmentation with TIR, CoT, and RAG datasets enhanced model accuracy a little bit,increased up to 71. Next at the end, a larger variant of Qwen, named Qwen2.5-32B-Instruct-AWQ outperformed all other remaining models in solving problems signifying the fact that it improves with parameter enhancement. This one might be improved too with help of our additional fine-tuning,RAG and prompt techniques.The study concluded that retrieval-augmented methods and wellcurated datasets improve model performance in multilingual, specialized domains.
该研究旨在通过寻找最佳模型配置、调整并应用RAG(检索增强生成)来提高孟加拉语AI数学性能。初始阶段测试了GPT-4o的推理和语言能力。在BDMO数据集中,GPT4o-TIR的准确率最高,达到209题中的130题。当孟加拉语为主要语言时,模型表现较差,表明需要更先进和多功能的模型。第二阶段研究了TIR智能体数量和指令设置对Numina7B-TIR模型性能的影响。在每道题使用10个智能体并以清晰的英文指令作为提示时,模型在测试数据集上获得了100分中的64分。接下来,我们测试了Qwen2.5-7B-Instruct。这是一个更通用且有效的配置,使用5个智能体和5层深度时获得了70分。在最后的测试阶段,通过TIR、CoT(链式思维)和RAG数据集进行数据增强,模型准确率略有提升,达到71分。最后,Qwen的一个更大变体,名为Qwen2.5-32B-Instruct-AWQ,在解决问题时表现优于其他所有模型,表明其性能随着参数增强而提升。该模型也可以通过我们的额外微调、RAG和提示技术进一步改进。研究得出结论,检索增强方法和精心策划的数据集在多语言、专业领域中提高了模型性能。
In this section, we discuss the various experiments conducted to evaluate the model’s performance in solving Bangla AI math problems. Each experiment focuses on specific aspects of problem-solving, including problem categorization, prompt structuring, multilingual reasoning, and response nuances. These insights provide valuable guidelines for optimizing the model’s performance on different types of mathematical problems in Bangla. From Table 1 and Table 2,it is further observed that if the problem statement is in Bangla and the solution steps are described in English, there is an enhanced accuracy score over the datasets while using direct prompting without any tool integration.
在本节中,我们讨论了为评估模型在解决孟加拉语 AI 数学问题中的表现而进行的各种实验。每个实验都侧重于解决问题的特定方面,包括问题分类、提示结构、多语言推理和响应细节。这些见解为优化模型在不同类型孟加拉语数学问题上的表现提供了宝贵的指导。从表 1 和表 2 中进一步观察到,如果问题陈述是孟加拉语,而解决步骤是用英语描述的,那么在使用直接提示而不集成任何工具的情况下,数据集的准确率得分有所提高。
4.1 Problem Categorization for Improved Performance
4.1 问题分类以提高性能
We observed that categorizing problems into specific types improves the model’s ability to generate accurate responses. When problems are grouped by category—such as Number Theory, Geometry, and Combinatorics—the model benefits from prompts tailored to each category’s unique solving techniques.
我们观察到,将问题分类为特定类型可以提高模型生成准确响应的能力。当问题按类别(如数论、几何和组合数学)分组时,模型能够从针对每个类别独特解决技巧的提示中受益。
4.2 Tailored Prompting for Problem Types
4.2 针对问题类型的定制化提示
Specific types of problems showed better performance when prompted with targeted solving techniques:
特定类型的问题在使用针对性解决技术提示时表现更好:
4.3 Multilingual Reasoning and Repetitive Querying
4.3 多语言推理与重复查询
In general, the model achieves higher accuracy in English than in Bangla. However, during experimentation, we found that if multiple queries are submitted—one translated into English and the other kept in its original Bangla form—the model can enhance its reasoning capabilities through repetitive reasoning. This approach allows the model to analyze the problem in different linguistic contexts, resulting in more accurate solutions.
总体而言,模型在英语中的准确率高于孟加拉语。然而,在实验过程中,我们发现如果提交多个查询——一个翻译成英语,另一个保持其原始的孟加拉语形式——模型可以通过重复推理来增强其推理能力。这种方法使模型能够在不同的语言环境中分析问题,从而得出更准确的解决方案。
4.4 Prompt Phrasing and Politeness
4.4 提示词措辞与礼貌
Interestingly, a few cases revealed that the model follows prompts more strictly if the word “please” is included at the beginning of the prompt. Although unconventional, this observation suggests that polite phrasing can influence the model’s adherence to prompt instructions in certain scenarios.
有趣的是,一些案例表明,如果在提示的开头加上“请”这个词,模型会更严格地遵循提示。尽管这种做法并不常见,但这一观察表明,在某些场景下,礼貌的措辞可能会影响模型对提示指令的遵循程度。
5 Conclusion and Future Work
5 结论与未来工作
The overall integrated approach used by ourselves suggests that a proper fine tuning with augmented datasets provide comparatively a slight better result. The overall evaluation score would improve provided the model is fine tuned with Bangla Mathematics dataset. In our experiment. incorporating RAG which involves keyword search based similarity does not improve up to satisfactory level.Model performance may be better in case of using better retrieval models for RAG.Same goes for our large language model too. The more higher parameter a language model is, the more accurate result it would produce too. It is probable that a more curated and well organized dataset fine tuning may lead to more improvised and upgraded result in future.
我们采用的整体集成方法表明,使用增强数据集进行适当的微调能够提供相对较好的结果。如果模型使用孟加拉数学数据集进行微调,整体评估分数将会提高。在我们的实验中,引入基于关键词搜索相似性的 RAG 并没有达到令人满意的水平。如果使用更好的检索模型,RAG 的模型性能可能会更好。对于大语言模型也是如此,参数越多的语言模型,其产生的结果也会越准确。未来,使用更加精心策划和组织的数据集进行微调可能会带来更加改进和升级的结果。
Table 7. Observations and Recommendations for Enhancing AI Math Problem Solving
表 7: 提升AI数学问题解决能力的观察与建议
| 方面 | 观察 | 建议 |
|---|---|---|
| 问题分类 | 将问题分类为类型(例如数论、几何)可以提高准确性 | 按类型分组问题,以便定制提示 |
| 数论提示 | 模型在使用暴力破解提示时在数论问题上表现更好 | 在数论提示中包含暴力破解提示 |
| 几何提示 | 在某些情况下,坐标几何方法提高了模型的表现 | 在几何提示中包含坐标几何转换提示 |
| 组合数学与函数方程 | 动态规划提示增强了组合、函数和递归问题的响应 | 针对这些问题类型使用动态规划提示 |
| 多语言推理 | 模型在英语中的表现优于孟加拉语;重复推理提高了结果 | 使用孟加拉语和翻译后的英语查询进行重复推理 |
| 提示中的礼貌 | 在提示中添加“请”有时会导致更严格的遵守 | 考虑在提示中使用礼貌措辞,如“请”,以增加遵守度 |
6 Dataset Availability
6 数据集可用性
All the mentioned datasets can be downloaded from here.
所有提到的数据集都可以从这里下载。
7 Model Availability
7 模型可用性
All the fine-tuned model can be downloaded from here.
所有微调后的模型都可以从这里下载。
