[论文翻译]SHAKTI:专为边缘AI和低资源环境优化的25亿参数小语言模型


原文地址:https://arxiv.org/pdf/2410.11331v1


SHAKTI: A 2.5 BILLION PARAMETER SMALL LANGUAGE MODEL OPTIMIZED FOR EDGE AI AND LOW-RESOURCE ENVIRONMENTS

SHAKTI:专为边缘AI和低资源环境优化的25亿参数小语言模型

Syed Abdul Gaffar Shakhadri Lead AI Developer SandLogic Technologies Pvt Ltd. syed.abdul@sandlogic.com

Syed Abdul Gaffar Shakhadri 首席AI开发工程师 SandLogic科技有限公司 syed.abdul@sandlogic.com

Rakshit Aralimatti AI Developer SandLogic Technologies Pvt Ltd rakshit.aralimatti@sandlogic.com

Rakshit Aralimatti AI开发工程师 SandLogic科技有限公司 rakshit.aralimatti@sandlogic.com

ABSTRACT

摘要

We introduce Shakti, a 2.5 billion parameter language model specifically optimized for resourceconstrained environments such as edge devices, including smartphones, wearables, and IoT systems. Shakti combines high-performance NLP with optimized efficiency and precision, making it ideal for real-time AI applications where computational resources and memory are limited. With support for vernacular languages and domain-specific tasks, Shakti excels in industries such as healthcare, finance, and customer service. Benchmark evaluations demonstrate that Shakti performs competitively against larger models while maintaining low latency and on-device efficiency, positioning it as a leading solution for edge AI.

我们推出Shakti,这是一款专为智能手机、可穿戴设备和物联网系统等资源受限的边缘设备优化的25亿参数大语言模型。Shakti将高性能自然语言处理(NLP)与优化效率及精度相结合,成为计算资源和内存有限的实时AI应用的理想选择。该模型支持方言和领域特定任务,在医疗、金融和客服等行业表现卓越。基准评估表明,Shakti在保持低延迟和设备端高效的同时,性能可媲美更大规模的模型,使其成为边缘AI领域的领先解决方案。

1 Introduction

1 引言

Large Language Models (LLMs), such as GPT-3 [1] and LLaMA [2], have made substantial strides in the field of Natural Language Processing (NLP), delivering state-of-the-art performance across tasks like text sum mari z ation, machine translation, and question answering. However, their considerable computational and memory requirements render them impractical for deployment on edge devices such as smartphones, wearables, and Internet of Things (IoT) devices, where low-latency and energy efficiency are critical for real-time applications.

大语言模型 (LLM) ,例如 GPT-3 [1] 和 LLaMA [2] ,在自然语言处理 (NLP) 领域取得了重大进展,在文本摘要、机器翻译和问答等任务中实现了最先进的性能。然而,其庞大的计算和内存需求使得它们难以部署在智能手机、可穿戴设备和物联网 (IoT) 设备等边缘设备上,而这些设备对实时应用的低延迟和能效有着严格要求。

The scaling laws that govern LLM performance suggest that increasing model size and dataset volume leads to better results [3]. However, the computational complexity and resource demands associated with larger models present a significant challenge for real-time deployment on devices with limited hardware. Industries such as healthcare, finance, and customer service require domain-specific insights with minimal latency, which current LLM architectures struggle to provide due to their reliance on cloud infrastructure and specialized hardware.

支配大语言模型性能的扩展规律表明,增大模型规模和数据量能带来更好的效果 [3]。然而,更大模型伴随的计算复杂度和资源需求,对硬件受限设备的实时部署构成了重大挑战。医疗、金融和客服等行业需要低延迟的领域特定洞察,而当前大语言模型架构因依赖云基础设施和专用硬件,难以满足这一需求。

To address these challenges, Shakti was developed as a solution that balances high performance, efficiency, and s cal ability, making it well-suited for resource-constrained environments. Shakti combines several technical innovations to enhance its efficiency and performance on edge devices.

为解决这些挑战,Shakti 应运而生,它平衡了高性能、高效性和可扩展性,非常适合资源受限的环境。Shakti 结合了多项技术创新,以提升其在边缘设备上的效率和性能。

One of Shakti’s core innovations is the introduction of Variable Grouped Query Attention (VGQA). VGQA groups multiple queries per key during attention computations, significantly reducing the memory footprint and accelerating inference times. Inspired by models like Mistral and Phi-3 [4, 5], this mechanism ensures that Shakti operates efficiently in low-latency, real-time environments, making it ideal for tasks where speed and resource efficiency are paramount.

Shakti的核心创新之一是引入了可变分组查询注意力机制 (Variable Grouped Query Attention,VGQA)。VGQA在注意力计算过程中将多个查询分组到每个键上,显著降低了内存占用并加速了推理时间。受Mistral和Phi-3等模型的启发 [4, 5],该机制确保Shakti在低延迟、实时环境中高效运行,使其成为速度和资源效率至关重要的任务的理想选择。

In addition to VGQA, Shakti incorporates pre-normalization and SwiGLU activation s, which improve the training process by stabilizing gradient flows and preventing issues like vanishing or exploding gradients. Compared to traditional activation functions like ReLU, SwiGLU provides more consistent training results and ensures efficient gradient flow, particularly in resource-constrained environments [6].

除了VGQA,Shakti还采用了预归一化(pre-normalization)和SwiGLU激活函数,通过稳定梯度流并防止梯度消失或爆炸等问题来优化训练过程。与传统激活函数(如ReLU)相比,SwiGLU能提供更稳定的训练结果,并确保高效的梯度流动,尤其在资源受限的环境中表现优异 [6]。

To handle long text sequences without increasing computational overhead, Shakti integrates Rotary Positional Embeddings (RoPE) [7]. RoPE enhances the model’s ability to process longer sequences efficiently, making it suitable for tasks such as document sum mari z ation and complex queries, all while maintaining low memory usage.

为在不增加计算开销的情况下处理长文本序列,Shakti集成了旋转位置编码 (RoPE) [7]。RoPE增强了模型高效处理长序列的能力,使其适用于文档摘要和复杂查询等任务,同时保持较低的内存使用量。

Moreover, Shakti’s versatility extends to its ability to handle domain-specific tasks through fine-tuning on datasets enriched with vernacular languages. This fine-tuning enables the model to perform exceptionally well in multilingual environments, particularly in regions where low-resource languages such as Hindi, Kannada, and Telugu dominate. In contrast to global models that often struggle in these markets, Shakti offers robust performance across both multilingual and domain-specific contexts.

此外,Shakti的多功能性还体现在其能够通过对丰富方言语言的数据集进行微调来处理特定领域的任务。这种微调使该模型在多语言环境中表现尤为出色,尤其是在印地语、卡纳达语和泰卢固语等低资源语言占主导地位的地区。与在这些市场中常常表现不佳的全球模型相比,Shakti在多语言和特定领域场景下均展现出强大的性能。

These innovations position Shakti as a highly efficient and scalable solution for on-device AI. By delivering high performance while optimizing for energy efficiency and low-latency applications, Shakti addresses the growing demand for real-time AI across industries that require localized AI solutions and low-resource deployments. Its ability to balance these demands ensures that it remains competitive against larger models in real-world AI applications, particularly in resource-constrained environments.

这些创新使Shakti成为设备端AI(Artificial Intelligence)领域高效且可扩展的解决方案。通过提供高性能同时优化能效和低延迟应用,Shakti满足了各行业对实时AI日益增长的需求,这些行业需要本地化AI解决方案和低资源部署。其平衡这些需求的能力确保了Shakti在实际AI应用中(特别是在资源受限环境中)仍能保持与大型模型的竞争力。

2 Related Work: Transformer Architectures, Small Language Models, and On-Device AI

2 相关工作:Transformer架构、小语言模型与端侧AI

The Transformer architecture, introduced by Vaswani et al. [8], revolutionized Natural Language Processing (NLP) by leveraging the self-attention mechanism, which allowed for parallel computation and greater s cal ability. This innovation paved the way for Large Language Models (LLMs) to achieve state-of-the-art performance across tasks such as text generation, translation, and question answering. However, the computational and memory requirements of these models pose challenges for deployment on resource-constrained devices such as smartphones and IoT devices.

Transformer 架构由 Vaswani 等人 [8] 提出,它通过自注意力机制实现了并行计算和更强的可扩展性,彻底改变了自然语言处理 (NLP) 领域。这一创新为大语言模型 (LLM) 在文本生成、翻译和问答等任务上实现最先进性能铺平了道路。然而,这些模型对计算和内存的高需求,给智能手机和物联网设备等资源受限设备的部署带来了挑战。

2.1 Evolution of Transformer Architectures

2.1 Transformer架构的演进

The original Transformer model [8] replaced traditional sequence models like LSTMs and GRUs with a multi-head self-attention mechanism, enabling faster and more accurate training on large datasets. Since then, models like BERT [9], GPT-3 [1], and T5 [10] have expanded the size and scale of these architectures by leveraging massive datasets and computational resources to achieve breakthroughs in language understanding and text generation.

最初的Transformer模型[8]用多头自注意力机制取代了LSTM和GRU等传统序列模型,从而能够在大规模数据集上实现更快更准确的训练。此后,BERT[9]、GPT-3[1]和T5[10]等模型通过利用海量数据和计算资源,扩展了这些架构的规模和体量,在语言理解和文本生成领域取得了突破性进展。

Scaling models to billions of parameters, however, makes them impractical for use in low-resource environments. For instance, LLaMA [2] introduced several optimization s such as pre-normalization and Rotary Positional Embeddings (RoPE) [7], significantly reducing memory usage while maintaining competitive performance. These innovations set a new standard for balancing performance and efficiency in LLMs.

然而,将模型扩展到数十亿参数规模使其难以在资源受限的环境中部署。例如,LLaMA [2] 引入了预归一化 (pre-normalization) 和旋转位置嵌入 (Rotary Positional Embeddings, RoPE) [7] 等优化技术,在保持竞争力的同时显著降低了内存占用。这些创新为大语言模型的性能与效率平衡设立了新标准。

Further advancements came with models like Mistral 7B [4] and Phi-3 Mini [5], which introduced techniques such as Grouped Query Attention (GQA) and sliding window attention, enhancing inference efficiency by reducing redundant computations. These models illustrate efforts to optimize Transformer architectures for real-time applications on devices with limited memory and processing power.

随后出现了如 Mistral 7B [4] 和 Phi-3 Mini [5] 等模型,它们引入了分组查询注意力 (GQA) 和滑动窗口注意力等技术,通过减少冗余计算提升了推理效率。这些模型展现了为内存和处理能力有限的设备优化 Transformer 架构以支持实时应用的努力。

2.2 Small Language Models (SLMs)

2.2 小语言模型 (SLMs)

The rise of Small Language Models (SLMs) has made it possible to deploy AI efficiently on resource-constrained devices. DistilBERT [11], for example, uses knowledge distillation, which transfers knowledge from a larger ”teacher” model to a smaller ”student” model, retaining much of the performance while significantly reducing the number of parameters. Other models like TinyBERT [12] and MobileBERT [13] have adopted similar techniques, cutting computational costs by over 40

小语言模型 (Small Language Models, SLMs) 的兴起使得在资源受限设备上高效部署AI成为可能。例如 DistilBERT [11] 采用知识蒸馏技术,将大型"教师"模型的知识迁移到小型"学生"模型中,在显著减少参数量的同时保留了大部分性能。TinyBERT [12] 和 MobileBERT [13] 等其他模型也采用了类似技术,将计算成本降低了40%以上

In addition to knowledge distillation, techniques like model pruning and quantization have also been applied to optimize models for on-device deployment. Model pruning removes parameters that minimally impact performance, while quantization reduces the precision of weights and activation s, thereby lowering memory usage and computation time [14]. These techniques have been used in models like MobileBERT [13] and EdgeBERT [15], making them more suitable for mobile and IoT devices.

除了知识蒸馏 (knowledge distillation) 外,模型剪枝 (model pruning) 和量化 (quantization) 等技术也被应用于优化设备端部署模型。模型剪枝通过移除对性能影响微小的参数来精简模型,而量化则通过降低权重和激活值的精度来减少内存占用和计算时间 [14]。这些技术已应用于 MobileBERT [13] 和 EdgeBERT [15] 等模型,使其更适用于移动设备和物联网设备。

2.3 Advances in On-Device AI

2.3 设备端人工智能 (On-Device AI) 进展

As the need for on-device AI grows, deploying models on edge devices such as smartphones and IoT systems has gained importance due to the demand for real-time inference and the need for improved data privacy. Models designed for edge devices must balance accuracy, speed, memory efficiency, and energy consumption to be effective in resourceconstrained environments.

随着设备端AI需求的增长,由于实时推理需求和提升数据隐私的要求,在智能手机及物联网系统等边缘设备上部署模型变得愈发重要。为边缘设备设计的模型必须在精度、速度、内存效率和能耗之间取得平衡,才能在资源受限的环境中高效运行。

Models such as EdgeBERT [15] and Edge Transformers [16] have introduced lightweight attention mechanisms that reduce memory requirements while maintaining high performance. Techniques like block-wise memory management [14] and sliding window attention allow these models to process sequences efficiently without sacrificing accuracy. Additionally, quantization enables these models to run with 8-bit precision or lower, significantly reducing computational load and making them well-suited for low-power devices.

EdgeBERT [15] 和 Edge Transformers [16] 等模型引入了轻量级注意力机制,在保持高性能的同时降低了内存需求。块式内存管理 [14] 和滑动窗口注意力等技术使这些模型能够高效处理序列而不牺牲准确性。此外,量化技术使模型能以 8 位或更低精度运行,显著减少计算负载,使其非常适合低功耗设备。

In this context, Shakti’s architecture builds on these advances by integrating key techniques from LLaMA and Mistral, while introducing innovations such as Variable Grouped Query Attention (VGQA) and SwiGLU activation s, allowing it to deliver real-time performance on edge devices without extensive hardware. This makes Shakti an ideal solution for on-device AI applications, where low-latency and efficiency are critical.

在此背景下,Shakti的架构通过整合LLaMA和Mistral的关键技术,同时引入可变分组查询注意力 (Variable Grouped Query Attention, VGQA) 和SwiGLU激活函数等创新,使其能够在无需大量硬件支持的情况下,在边缘设备上实现实时性能。这使得Shakti成为设备端AI应用的理想解决方案,其中低延迟和高效率至关重要。

3 Architecture of Shakti-LLM

3 Shakti-LLM 架构

The architecture of Shakti-LLM refer Table 1 is optimized for resource-constrained environments such as edge devices, including smartphones, wearables, and IoT systems. With 2.5 billion parameters and a context length of 4096 tokens, Shakti is designed for high-performance NLP with a focus on real-time applications.

Shakti-LLM的架构参考表1,针对智能手机、可穿戴设备和物联网系统等资源受限的边缘设备环境进行了优化。Shakti拥有25亿参数和4096 token的上下文长度,专为高性能自然语言处理设计,尤其注重实时应用。

One of the core innovations in Shakti-LLM is the use of Variable Grouped Query Attention (VGQA), inspired by models such as Mistral 7B [4] and Phi-3 Mini [5]. VGQA allows multiple queries to share a single key during the attention process, significantly reducing the memory footprint while improving inference times. This makes Shakti highly suitable for low-latency applications, such as virtual assistants and smart home devices.

Shakti-LLM的核心创新之一是采用了可变分组查询注意力 (Variable Grouped Query Attention, VGQA),其灵感来源于Mistral 7B [4]和Phi-3 Mini [5]等模型。VGQA允许在注意力过程中多个查询共享一个键,显著降低了内存占用,同时提升了推理速度。这使得Shakti特别适合低延迟应用场景,例如虚拟助手和智能家居设备。

Shakti also employs Pre-Normalization and SwiGLU activation s to stabilize the training process. By normalizing the input before it is passed to the attention mechanism, Pre-Normalization prevents vanishing or exploding gradients, while SwiGLU activation functions enhance gradient flow, resulting in more efficient training [17]. These methods provide significant improvements over traditional activation functions like ReLU .

Shakti还采用了预归一化(Pre-Normalization)和SwiGLU激活函数来稳定训练过程。通过在输入传递到注意力机制之前进行归一化处理,预归一化能够防止梯度消失或爆炸,而SwiGLU激活函数则增强了梯度流动,从而实现更高效的训练[17]。这些方法相比ReLU等传统激活函数带来了显著改进。

To handle long sequences efficiently, Shakti incorporates Rotary Positional Embeddings (RoPE) [7]. RoPE enables Shakti to process long text contexts—such as document sum mari z ation and multi-turn dialogue processing—without significantly increasing memory usage [7]. This makes Shakti particularly effective in tasks that require the handling of long sequences while maintaining a small computational footprint.

为高效处理长序列,Shakti采用了旋转位置编码(Rotary Positional Embeddings,RoPE)[7]。RoPE使Shakti能够处理长文本上下文(例如文档摘要和多轮对话处理)而不会显著增加内存占用[7]。这使得Shakti在需要处理长序列且保持较小计算量的任务中表现尤为出色。

Lastly, Direct Preference Optimization (DPO) is used to fine-tune Shakti based on ranked human feedback [17]. Unlike Reinforcement Learning from Human Feedback (RLHF), which requires a reward model, DPO simplifies the optimization process by directly learning from human preferences through a log-sigmoid loss function [17, 18]. This ensures that Shakti generates user-aligned outputs efficiently, making it ideal for real-time AI applications like customer support and healthcare [19].

最后,采用直接偏好优化 (Direct Preference Optimization, DPO) 基于排序的人类反馈对 Shakti 进行微调 [17]。与需要奖励模型的基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 不同,DPO 通过 log-sigmoid 损失函数直接从人类偏好中学习,从而简化了优化流程 [17, 18]。这确保 Shakti 能高效生成符合用户需求的输出,使其非常适合客户支持和医疗保健等实时 AI 应用 [19]。

Shakti-LLM is designed to be scalable, efficient, and highly adaptable for real-world use cases in industries such as healthcare, finance, and customer service, where low-latency and real-time performance are essential.

Shakti-LLM 旨在实现可扩展性、高效性和高度适应性,适用于医疗保健、金融和客户服务等行业中对低延迟和实时性能要求极高的实际应用场景。

Shakti supports sliding window attention and Key-Value Caching, ensuring efficient processing of long sequences during inference. These optimization s make Shakti suitable for edge computing environments, where memory efficiency and real-time processing are essential.

Shakti支持滑动窗口注意力机制和键值缓存(Key-Value Caching),确保在推理过程中高效处理长序列。这些优化使Shakti非常适合内存效率和实时处理至关重要的边缘计算环境。

4 Training and Fine-Tuning Methodologies

4 训练与微调方法

Shakti-LLM’s training and fine-tuning processes are designed to optimize its performance for both general NLP tasks and domain-specific applications. This section provides an in-depth look at the methodologies employed to enhance the model’s capabilities.

Shakti-LLM的训练和微调流程旨在优化其在通用NLP任务和领域特定应用中的性能。本节深入探讨了用于增强模型能力的方法论。

Table 1: Specifications of Shakti-LLM

表 1: Shakti-LLM 规格参数

特性 Shakti-LLM 规格
模型参数 25亿
层数 16
模型维度 4096
FFN维度 4096
注意力头数 32
键/值头数 8
峰值学习率 3.6e-5
激活函数 SwiGLU
词表大小 128256
位置编码 RoPE (0 = 500,000)
GPU占用(原始) 9 GB
GPU占用(量化后) 4GB

4.1 Continued Pre training (CPT)

4.1 持续预训练 (CPT)

The first stage of Shakti-LLM’s training involves Continued Pre training (CPT), where the model is exposed to largescale datasets to capture general language structures and patterns. The model is initialized with random weights and tasked with predicting the next token in a sequence, which is a standard approach for language model pre training.

Shakti-LLM 训练的第一阶段是持续预训练 (CPT) ,模型通过接触大规模数据集来捕捉通用语言结构和模式。模型以随机权重初始化,并负责预测序列中的下一个 Token ,这是语言模型预训练的标准方法。

Shakti-LLM’s training corpus includes approximately 2.8 trillion tokens, sourced from high-quality datasets, including:

Shakti-LLM的训练语料库包含约2.8万亿token,数据源自高质量数据集,包括:

During this phase, Shakti-LLM is trained with a learning rate of 2.0×104 and a maximum sequence length of 4096 tokens. The gradient accumulation steps are set to 1, with a warmup ratio of 0.1 to ensure smooth convergence. These hyper parameters are carefully selected to balance training speed with the model’s ability to capture complex linguistic patterns across the large-scale datasets [20].

在此阶段,Shakti-LLM以 2.0×104 的学习率和4096个token的最大序列长度进行训练。梯度累积步数设为1,预热比例为0.1以确保平稳收敛。这些超参数经过精心选择,以平衡训练速度与模型在大规模数据集[20]中捕捉复杂语言模式的能力。

4.2 Supervised Fine-Tuning (SFT)

4.2 监督式微调 (Supervised Fine-Tuning, SFT)

After pre training, Shakti-LLM undergoes Supervised Fine-Tuning (SFT) to adapt to specific, task-oriented datasets. This phase exposes the model to labeled examples from a wide range of applications, improving its ability to handle domain-specific tasks and provide con textually relevant responses.

在预训练之后,Shakti-LLM会进行监督微调 (Supervised Fine-Tuning, SFT) 以适应特定任务导向的数据集。这一阶段让模型接触来自广泛应用的标注示例,提升其处理领域特定任务和提供上下文相关响应的能力。

The fine-tuning process employs a learning rate of 2.0×105 with a cosine decay learning rate scheduler, which adaptively reduces the learning rate as training progresses. The maximum sequence length remains at 4096 tokens, and the gradient accumulation steps are set to 1. These hyper parameters are optimized to allow for fine-grained adjustments, ensuring that Shakti-LLM excels in tasks requiring domain-specific knowledge while maintaining generalization capabilities.

微调过程采用 2.0×105 的学习率,并搭配余弦衰减学习率调度器,该调度器会随着训练进程自适应降低学习率。最大序列长度保持为4096个token,梯度累积步数设为1。这些超参数经过优化以实现细粒度调整,确保Shakti-LLM在需要领域特定知识的任务中表现出色,同时保持泛化能力。

Key datasets such as Ultrachat 200K and Cosmedia V2 are leveraged during SFT to enhance Shakti-LLM’s conversational abilities and its capacity to understand complex domains like healthcare and finance. The fine-tuning stage enables the model to follow specific instructions more effectively and handle real-world user prompts, similar to approaches seen in Instruct GP T.

在SFT(监督微调)阶段,关键数据集如Ultrachat 200K和Cosmedia V2被用于增强Shakti-LLM的对话能力及其理解医疗、金融等复杂领域的能力。微调阶段使模型能更有效地遵循特定指令并处理真实场景中的用户提示,其方法类似于InstructGPT所采用的策略。

4.3 Direct Preference Optimization (DPO)

4.3 直接偏好优化 (Direct Preference Optimization, DPO)

In the final stage of training, Direct Preference Optimization (DPO) is employed to align Shakti-LLM’s outputs with human preferences, ensuring that its responses are con textually and ethically aligned. Unlike Reinforcement Learning from Human Feedback (RLHF), which relies on a reward model, DPO simplifies the optimization process by directly learning from ranked human feedback through a log-sigmoid loss function [17].

在训练的最后阶段,采用直接偏好优化 (Direct Preference Optimization, DPO) 使Shakti-LLM的输出与人类偏好对齐,确保其响应在上下文和伦理层面保持一致。与依赖奖励模型的基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 不同,DPO通过log-sigmoid损失函数直接从排序的人类反馈中学习[17],从而简化了优化过程。

The DPO process involves presenting human annotators with multiple model outputs for the same input. These outputs are ranked based on relevance, clarity, and appropriateness, and a preference-based loss function is used to adjust the model’s weights accordingly. This ensures that Shakti-LLM generates outputs that are not only accurate but also aligned with ethical considerations and user expectations [19].

DPO(直接偏好优化)流程需要人类标注员对同一输入下的多个模型输出进行排序,这些输出根据相关性、清晰度和适用性进行评级,随后采用基于偏好的损失函数来相应调整模型权重。这一机制确保Shakti-LLM生成的输出不仅准确,同时符合伦理考量和用户预期 [19]。

For DPO, the model is fine-tuned using a learning rate of 5.0×107 , with a beta coefficient of 0.01 to manage optimization momentum. The maximum prompt length is set to 1024 tokens, and the model uses AdamW as the optimizer, with gradient accumulation steps set to 2. This combination of hyper parameters ensures the model is fine-tuned efficiently while retaining high-quality responses [17].

对于 DPO (Direct Preference Optimization) 方法,模型微调采用 5.0×107 的学习率,并通过 0.01 的 beta 系数控制优化动量。最大提示长度设为 1024 token,优化器选用 AdamW,梯度累积步数设置为 2