[论文翻译]SHAKTI:专为边缘AI和低资源环境优化的25亿参数小语言模型


原文地址:https://arxiv.org/pdf/2410.11331v1


SHAKTI: A 2.5 BILLION PARAMETER SMALL LANGUAGE MODEL OPTIMIZED FOR EDGE AI AND LOW-RESOURCE ENVIRONMENTS

SHAKTI:专为边缘AI和低资源环境优化的25亿参数小语言模型

Syed Abdul Gaffar Shakhadri Lead AI Developer SandLogic Technologies Pvt Ltd. syed.abdul@sandlogic.com

Syed Abdul Gaffar Shakhadri 首席AI开发工程师 SandLogic科技有限公司 syed.abdul@sandlogic.com

Rakshit Aralimatti AI Developer SandLogic Technologies Pvt Ltd rakshit.aralimatti@sandlogic.com

Rakshit Aralimatti AI开发工程师 SandLogic科技有限公司 rakshit.aralimatti@sandlogic.com

ABSTRACT

摘要

We introduce Shakti, a 2.5 billion parameter language model specifically optimized for resourceconstrained environments such as edge devices, including smartphones, wearables, and IoT systems. Shakti combines high-performance NLP with optimized efficiency and precision, making it ideal for real-time AI applications where computational resources and memory are limited. With support for vernacular languages and domain-specific tasks, Shakti excels in industries such as healthcare, finance, and customer service. Benchmark evaluations demonstrate that Shakti performs competitively against larger models while maintaining low latency and on-device efficiency, positioning it as a leading solution for edge AI.

我们推出Shakti,这是一款专为智能手机、可穿戴设备和物联网系统等资源受限的边缘设备优化的25亿参数大语言模型。Shakti将高性能自然语言处理(NLP)与优化效率及精度相结合,成为计算资源和内存有限的实时AI应用的理想选择。该模型支持方言和领域特定任务,在医疗、金融和客服等行业表现卓越。基准评估表明,Shakti在保持低延迟和设备端高效的同时,性能可媲美更大规模的模型,使其成为边缘AI领域的领先解决方案。

1 Introduction

1 引言

Large Language Models (LLMs), such as GPT-3 [1] and LLaMA [2], have made substantial strides in the field of Natural Language Processing (NLP), delivering state-of-the-art performance across tasks like text sum mari z ation, machine translation, and question answering. However, their considerable computational and memory requirements render them impractical for deployment on edge devices such as smartphones, wearables, and Internet of Things (IoT) devices, where low-latency and energy efficiency are critical for real-time applications.

大语言模型 (LLM) ,例如 GPT-3 [1] 和 LLaMA [2] ,在自然语言处理 (NLP) 领域取得了重大进展,在文本摘要、机器翻译和问答等任务中实现了最先进的性能。然而,其庞大的计算和内存需求使得它们难以部署在智能手机、可穿戴设备和物联网 (IoT) 设备等边缘设备上,而这些设备对实时应用的低延迟和能效有着严格要求。

The scaling laws that govern LLM performance suggest that increasing model size and dataset volume leads to better results [3]. However, the computational complexity and resource demands associated with larger models present a significant challenge for real-time deployment on devices with limited hardware. Industries such as healthcare, finance, and customer service require domain-specific insights with minimal latency, which current LLM architectures struggle to provide due to their reliance on cloud infrastructure and specialized hardware.

支配大语言模型性能的扩展规律表明,增大模型规模和数据量能带来更好的效果 [3]。然而,更大模型伴随的计算复杂度和资源需求,对硬件受限设备的实时部署构成了重大挑战。医疗、金融和客服等行业需要低延迟的领域特定洞察,而当前大语言模型架构因依赖云基础设施和专用硬件,难以满足这一需求。

To address these challenges, Shakti was developed as a solution that balances high performance, efficiency, and s cal ability, making it well-suited for resource-constrained environments. Shakti combines several technical innovations to enhance its efficiency and performance on edge devices.

为解决这些挑战,Shakti 应运而生,它平衡了高性能、高效性和可扩展性,非常适合资源受限的环境。Shakti 结合了多项技术创新,以提升其在边缘设备上的效率和性能。

One of Shakti’s core innovations is the introduction of Variable Grouped Query Attention (VGQA). VGQA groups multiple queries per key during attention computations, significantly reducing the memory footprint and accelerating inference times. Inspired by models like Mistral and Phi-3 [4, 5], this mechanism ensures that Shakti operates efficiently in low-latency, real-time environments, making it ideal for tasks where speed and resource efficiency are paramount.

Shakti的核心创新之一是引入了可变分组查询注意力机制 (Variable Grouped Query Attention,VGQA)。VGQA在注意力计算过程中将多个查询分组到每个键上,显著降低了内存占用并加速了推理时间。受Mistral和Phi-3等模型的启发 [4, 5],该机制确保Shakti在低延迟、实时环境中高效运行,使其成为速度和资源效率至关重要的任务的理想选择。

In addition to VGQA, Shakti incorporates pre-normalization and SwiGLU activation s, which improve the training process by stabilizing gradient flows and preventing issues like vanishing or exploding gradients. Compared to traditional activation functions like ReLU, SwiGLU provides more consistent training results and ensures efficient gradient flow, particularly in resource-constrained environments [6].

除了VGQA,Shakti还采用了预归一化(pre-normalization)和SwiGLU激活函数,通过稳定梯度流并防止梯度消失或爆炸等问题来优化训练过程。与传统激活函数(如ReLU)相比,SwiGLU能提供更稳定的训练结果,并确保高效的梯度流动,尤其在资源受限的环境中表现优异 [6]。

To handle long text sequences without increasing computational overhead, Shakti integrates Rotary Positional Embeddings (RoPE) [7]. RoPE enhances the model’s ability to process longer sequences efficiently, making it suitable for tasks such as document sum mari z ation and complex queries, all while maintaining low memory usage.

为在不增加计算开销的情况下处理长文本序列,Shakti集成了旋转位置编码 (RoPE) [7]。RoPE增强了模型高效处理长序列的能力,使其适用于文档摘要和复杂查询等任务,同时保持较低的内存使用量。

Moreover, Shakti’s versatility extends to its ability to handle domain-specific tasks through fine-tuning on datasets enriched with vernacular languages. This fine-tuning enables the model to perform exceptionally well in multilingual environments, particularly in regions where low-resource languages such as Hindi, Kannada, and Telugu dominate. In contrast to global models that often struggle in these markets, Shakti offers robust performance across both multilingual and domain-specific contexts.

此外,Shakti的多功能性还体现在其能够通过对丰富方言语言的数据集进行微调来处理特定领域的任务。这种微调使该模型在多语言环境中表现尤为出色,尤其是在印地语、卡纳达语和泰卢固语等低资源语言占主导地位的地区。与在这些市场中常常表现不佳的全球模型相比,Shakti在多语言和特定领域场景下均展现出强大的性能。

These innovations position Shakti as a highly efficient and scalable solution for on-device AI. By delivering high performance while optimizing for energy efficiency and low-latency applications, Shakti addresses the growing demand for real-time AI across industries that require localized AI solutions and low-resource deployments. Its ability to balance these demands ensures that it remains competitive against larger models in real-world AI applications, particularly in resource-constrained environments.

这些创新使Shakti成为设备端AI(Artificial Intelligence)领域高效且可扩展的解决方案。通过提供高性能同时优化能效和低延迟应用,Shakti满足了各行业对实时AI日益增长的需求,这些行业需要本地化AI解决方案和低资源部署。其平衡这些需求的能力确保了Shakti在实际AI应用中(特别是在资源受限环境中)仍能保持与大型模型的竞争力。

2 Related Work: Transformer Architectures, Small Language Models, and On-Device AI

2 相关工作:Transformer架构、小语言模型与端侧AI

The Transformer architecture, introduced by Vaswani et al. [8], revolutionized Natural Language Processing (NLP) by leveraging the self-attention mechanism, which allowed for parallel computation and greater s cal ability. This innovation paved the way for Large Language Models (LLMs) to achieve state-of-the-art performance across tasks such as text generation, translation, and question answering. However, the computational and memory requirements of these models pose challenges for deployment on resource-constrained devices such as smartphones and IoT devices.

Transformer 架构由 Vaswani 等人 [8] 提出,它通过自注意力机制实现了并行计算和更强的可扩展性,彻底改变了自然语言处理 (NLP) 领域。这一创新为大语言模型 (LLM) 在文本生成、翻译和问答等任务上实现最先进性能铺平了道路。然而,这些模型对计算和内存的高需求,给智能手机和物联网设备等资源受限设备的部署带来了挑战。

2.1 Evolution of Transformer Architectures

2.1 Transformer架构的演进

The original Transformer model [8] replaced traditional sequence models like LSTMs and GRUs with a multi-head self-attention mechanism, enabling faster and more accurate training on large datasets. Since then, models like BERT [9], GPT-3 [1], and T5 [10] have expanded the size and scale of these architectures by leveraging massive datasets and computational resources to achieve breakthroughs in language understanding and text generation.

最初的Transformer模型[8]用多头自注意力机制取代了LSTM和GRU等传统序列模型,从而能够在大规模数据集上实现更快更准确的训练。此后,BERT[9]、GPT-3[1]和T5[10]等模型通过利用海量数据和计算资源,扩展了这些架构的规模和体量,在语言理解和文本生成领域取得了突破性进展。

Scaling models to billions of parameters, however, makes them impractical for use in low-resource environments. For instance, LLaMA [2] introduced several optimization s such as pre-normalization and Rotary Positional Embeddings (RoPE) [7], significantly reducing memory usage while maintaining competitive performance. These innovations set a new standard for balancing performance and efficiency in LLMs.

然而,将模型扩展到数十亿参数规模使其难以在资源受限的环境中部署。例如,LLaMA [2] 引入了预归一化 (pre-normalization) 和旋转位置嵌入 (Rotary Positional Embeddings, RoPE) [7] 等优化技术,在保持竞争力的同时显著降低了内存占用。这些创新为大语言模型的性能与效率平衡设立了新标准。

Further advancements came with models like Mistral 7B [4] and Phi-3 Mini [5], which introduced techniques such as Grouped Query Attention (GQA) and sliding window attention, enhancing inference efficiency by reducing redundant computations. These models illustrate efforts to optimize Transformer architectures for real-time applications on devices with limited memory and processing power.

随后出现了如 Mistral 7B [4] 和 Phi-3 Mini [5] 等模型,它们引入了分组查询注意力 (GQA) 和滑动窗口注意力等技术,通过减少冗余计算提升了推理效率。这些模型展现了为内存和处理能力有限的设备优化 Transformer 架构以支持实时应用的努力。

2.2 Small Language Models (SLMs)

2.2 小语言模型 (SLMs)

The rise of Small Language Models (SLMs) has made it possible to deploy AI efficiently on resource-constrained devices. DistilBERT [11], for example, uses knowledge distillation, which transfers knowledge from a larger ”teacher” model to a smaller ”student” model, retaining much of the performance while significantly reducing the number of parameters. Other models like TinyBERT [12] and MobileBERT [13] have adopted similar techniques, cutting computational costs by over 40

小语言模型 (Small Language Models, SLMs) 的兴起使得在资源受限设备上高效部署AI成为可能。例如 DistilBERT [11] 采用知识蒸馏技术,将大型"教师"模型的知识迁移到小型"学生"模型中,在显著减少参数量的同时保留了大部分性能。TinyBERT [12] 和 MobileBERT [13] 等其他模型也采用了类似技术,将计算成本降低了40%以上

In addition to knowledge distillation, techniques like model pruning and quantization have also been applied to optimize models for on-device deployment. Model pruning removes parameters that minimally impact performance, while quantization reduces the precision of weights and activation s, thereby lowering memory usage and computation time [14]. These techniques have been used in models like MobileBERT [13] and EdgeBERT [15], making them more suitable for mobile and IoT devices.

除了知识蒸馏 (knowledge distillation) 外,模型剪枝 (model pruning) 和量化 (quantization) 等技术也被应用于优化设备端部署模型。模型剪枝通过移除对性能影响微小的参数来精简模型,而量化则通过降低权重和激活值的精度来减少内存占用和计算时间 [14]。这些技术已应用于 MobileBERT [13] 和 EdgeBERT [15] 等模型,使其更适用于移动设备和物联网设备。

2.3 Advances in On-Device AI

2.3 设备端人工智能 (On-Device AI) 进展

As the need for on-device AI grows, deploying models on edge devices such as smartphones and IoT systems has gained importance due to the demand for real-time inference and the need for improved data privacy. Models designed for edge devices must balance accuracy, speed, memory efficiency, and energy consumption to be effective in resourceconstrained environments.

随着设备端AI需求的增长,由于实时推理需求和提升数据隐私的要求,在智能手机及物联网系统等边缘设备上部署模型变得愈发重要。为边缘设备设计的模型必须在精度、速度、内存效率和能耗之间取得平衡,才能在资源受限的环境中高效运行。

Models such as EdgeBERT [15] and Edge Transformers [16] have introduced lightweight attention mechanisms that reduce memory requirements while maintaining high performance. Techniques like block-wise memory management [14] and sliding window attention allow these models to process sequences efficiently without sacrificing accuracy. Additionally, quantization enables these models to run with 8-bit precision or lower, significantly reducing computational load and making them well-suited for low-power devices.

EdgeBERT [15] 和 Edge Transformers [16] 等模型引入了轻量级注意力机制,在保持高性能的同时降低了内存需求。块式内存管理 [14] 和滑动窗口注意力等技术使这些模型能够高效处理序列而不牺牲准确性。此外,量化技术使模型能以 8 位或更低精度运行,显著减少计算负载,使其非常适合低功耗设备。

In this context, Shakti’s architecture builds on these advances by integrating key techniques from LLaMA and Mistral, while introducing innovations such as Variable Grouped Query Attention (VGQA) and SwiGLU activation s, allowing it to deliver real-time performance on edge devices without extensive hardware. This makes Shakti an ideal solution for on-device AI applications, where low-latency and efficiency are critical.

在此背景下,Shakti的架构通过整合LLaMA和Mistral的关键技术,同时引入可变分组查询注意力 (Variable Grouped Query Attention, VGQA) 和SwiGLU激活函数等创新,使其能够在无需大量硬件支持的情况下,在边缘设备上实现实时性能。这使得Shakti成为设备端AI应用的理想解决方案,其中低延迟和高效率至关重要。

3 Architecture of Shakti-LLM

3 Shakti-LLM 架构

The architecture of Shakti-LLM refer Table 1 is optimized for resource-constrained environments such as edge devices, including smartphones, wearables, and IoT systems. With 2.5 billion parameters and a context length of 4096 tokens, Shakti is designed for high-performance NLP with a focus on real-time applications.

Shakti-LLM的架构参考表1,针对智能手机、可穿戴设备和物联网系统等资源受限的边缘设备环境进行了优化。Shakti拥有25亿参数和4096 token的上下文长度,专为高性能自然语言处理设计,尤其注重实时应用。

One of the core innovations in Shakti-LLM is the use of Variable Grouped Query Attention (VGQA), inspired by models such as Mistral 7B [4] and Phi-3 Mini [5]. VGQA allows multiple queries to share a single key during the attention process, significantly reducing the memory footprint while improving inference times. This makes Shakti highly suitable for low-latency applications, such as virtual assistants and smart home devices.

Shakti-LLM的核心创新之一是采用了可变分组查询注意力 (Variable Grouped Query Attention, VGQA),其灵感来源于Mistral 7B [4]和Phi-3 Mini [5]等模型。VGQA允许在注意力过程中多个查询共享一个键,显著降低了内存占用,同时提升了推理速度。这使得Shakti特别适合低延迟应用场景,例如虚拟助手和智能家居设备。

Shakti also employs Pre-Normalization and SwiGLU activation s to stabilize the training process. By normalizing the input before it is passed to the attention mechanism, Pre-Normalization prevents vanishing or exploding gradients, while SwiGLU activation functions enhance gradient flow, resulting in more efficient training [17]. These methods provide significant improvements over traditional activation functions like ReLU .

Shakti还采用了预归一化(Pre-Normalization)和SwiGLU激活函数来稳定训练过程。通过在输入传递到注意力机制之前进行归一化处理,预归一化能够防止梯度消失或爆炸,而SwiGLU激活函数则增强了梯度流动,从而实现更高效的训练[17]。这些方法相比ReLU等传统激活函数带来了显著改进。

To handle long sequences efficiently, Shakti incorporates Rotary Positional Embeddings (RoPE) [7]. RoPE enables Shakti to process long text contexts—such as document sum mari z ation and multi-turn dialogue processing—without significantly increasing memory usage [7]. This makes Shakti particularly effective in tasks that require the handling of long sequences while maintaining a small computational footprint.

为高效处理长序列,Shakti采用了旋转位置编码(Rotary Positional Embeddings,RoPE)[7]。RoPE使Shakti能够处理长文本上下文(例如文档摘要和多轮对话处理)而不会显著增加内存占用[7]。这使得Shakti在需要处理长序列且保持较小计算量的任务中表现尤为出色。

Lastly, Direct Preference Optimization (DPO) is used to fine-tune Shakti based on ranked human feedback [17]. Unlike Reinforcement Learning from Human Feedback (RLHF), which requires a reward model, DPO simplifies the optimization process by directly learning from human preferences through a log-sigmoid loss function [17, 18]. This ensures that Shakti generates user-aligned outputs efficiently, making it ideal for real-time AI applications like customer support and healthcare [19].

最后,采用直接偏好优化 (Direct Preference Optimization, DPO) 基于排序的人类反馈对 Shakti 进行微调 [17]。与需要奖励模型的基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 不同,DPO 通过 log-sigmoid 损失函数直接从人类偏好中学习,从而简化了优化流程 [17, 18]。这确保 Shakti 能高效生成符合用户需求的输出,使其非常适合客户支持和医疗保健等实时 AI 应用 [19]。

Shakti-LLM is designed to be scalable, efficient, and highly adaptable for real-world use cases in industries such as healthcare, finance, and customer service, where low-latency and real-time performance are essential.

Shakti-LLM 旨在实现可扩展性、高效性和高度适应性,适用于医疗保健、金融和客户服务等行业中对低延迟和实时性能要求极高的实际应用场景。

Shakti supports sliding window attention and Key-Value Caching, ensuring efficient processing of long sequences during inference. These optimization s make Shakti suitable for edge computing environments, where memory efficiency and real-time processing are essential.

Shakti支持滑动窗口注意力机制和键值缓存(Key-Value Caching),确保在推理过程中高效处理长序列。这些优化使Shakti非常适合内存效率和实时处理至关重要的边缘计算环境。

4 Training and Fine-Tuning Methodologies

4 训练与微调方法

Shakti-LLM’s training and fine-tuning processes are designed to optimize its performance for both general NLP tasks and domain-specific applications. This section provides an in-depth look at the methodologies employed to enhance the model’s capabilities.

Shakti-LLM的训练和微调流程旨在优化其在通用NLP任务和领域特定应用中的性能。本节深入探讨了用于增强模型能力的方法论。

Table 1: Specifications of Shakti-LLM

表 1: Shakti-LLM 规格参数

特性 Shakti-LLM 规格
模型参数 25亿
层数 16
模型维度 4096
FFN维度 4096
注意力头数 32
键/值头数 8
峰值学习率 3.6e-5
激活函数 SwiGLU
词表大小 128256
位置编码 RoPE (0 = 500,000)
GPU占用(原始) 9 GB
GPU占用(量化后) 4GB

4.1 Continued Pre training (CPT)

4.1 持续预训练 (CPT)

The first stage of Shakti-LLM’s training involves Continued Pre training (CPT), where the model is exposed to largescale datasets to capture general language structures and patterns. The model is initialized with random weights and tasked with predicting the next token in a sequence, which is a standard approach for language model pre training.

Shakti-LLM 训练的第一阶段是持续预训练 (CPT) ,模型通过接触大规模数据集来捕捉通用语言结构和模式。模型以随机权重初始化,并负责预测序列中的下一个 Token ,这是语言模型预训练的标准方法。

Shakti-LLM’s training corpus includes approximately 2.8 trillion tokens, sourced from high-quality datasets, including:

Shakti-LLM的训练语料库包含约2.8万亿token,数据源自高质量数据集,包括:

During this phase, Shakti-LLM is trained with a learning rate of $2.0\times10^{-4}$ and a maximum sequence length of 4096 tokens. The gradient accumulation steps are set to 1, with a warmup ratio of 0.1 to ensure smooth convergence. These hyper parameters are carefully selected to balance training speed with the model’s ability to capture complex linguistic patterns across the large-scale datasets [20].

在此阶段,Shakti-LLM以 $2.0\times10^{-4}$ 的学习率和4096个token的最大序列长度进行训练。梯度累积步数设为1,预热比例为0.1以确保平稳收敛。这些超参数经过精心选择,以平衡训练速度与模型在大规模数据集[20]中捕捉复杂语言模式的能力。

4.2 Supervised Fine-Tuning (SFT)

4.2 监督式微调 (Supervised Fine-Tuning, SFT)

After pre training, Shakti-LLM undergoes Supervised Fine-Tuning (SFT) to adapt to specific, task-oriented datasets. This phase exposes the model to labeled examples from a wide range of applications, improving its ability to handle domain-specific tasks and provide con textually relevant responses.

在预训练之后,Shakti-LLM会进行监督微调 (Supervised Fine-Tuning, SFT) 以适应特定任务导向的数据集。这一阶段让模型接触来自广泛应用的标注示例,提升其处理领域特定任务和提供上下文相关响应的能力。

The fine-tuning process employs a learning rate of $2.0\times10^{-5}$ with a cosine decay learning rate scheduler, which adaptively reduces the learning rate as training progresses. The maximum sequence length remains at 4096 tokens, and the gradient accumulation steps are set to 1. These hyper parameters are optimized to allow for fine-grained adjustments, ensuring that Shakti-LLM excels in tasks requiring domain-specific knowledge while maintaining generalization capabilities.

微调过程采用 $2.0\times10^{-5}$ 的学习率,并搭配余弦衰减学习率调度器,该调度器会随着训练进程自适应降低学习率。最大序列长度保持为4096个token,梯度累积步数设为1。这些超参数经过优化以实现细粒度调整,确保Shakti-LLM在需要领域特定知识的任务中表现出色,同时保持泛化能力。

Key datasets such as Ultrachat 200K and Cosmedia V2 are leveraged during SFT to enhance Shakti-LLM’s conversational abilities and its capacity to understand complex domains like healthcare and finance. The fine-tuning stage enables the model to follow specific instructions more effectively and handle real-world user prompts, similar to approaches seen in Instruct GP T.

在SFT(监督微调)阶段,关键数据集如Ultrachat 200K和Cosmedia V2被用于增强Shakti-LLM的对话能力及其理解医疗、金融等复杂领域的能力。微调阶段使模型能更有效地遵循特定指令并处理真实场景中的用户提示,其方法类似于InstructGPT所采用的策略。

4.3 Direct Preference Optimization (DPO)

4.3 直接偏好优化 (Direct Preference Optimization, DPO)

In the final stage of training, Direct Preference Optimization (DPO) is employed to align Shakti-LLM’s outputs with human preferences, ensuring that its responses are con textually and ethically aligned. Unlike Reinforcement Learning from Human Feedback (RLHF), which relies on a reward model, DPO simplifies the optimization process by directly learning from ranked human feedback through a log-sigmoid loss function [17].

在训练的最后阶段,采用直接偏好优化 (Direct Preference Optimization, DPO) 使Shakti-LLM的输出与人类偏好对齐,确保其响应在上下文和伦理层面保持一致。与依赖奖励模型的基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 不同,DPO通过log-sigmoid损失函数直接从排序的人类反馈中学习[17],从而简化了优化过程。

The DPO process involves presenting human annotators with multiple model outputs for the same input. These outputs are ranked based on relevance, clarity, and appropriateness, and a preference-based loss function is used to adjust the model’s weights accordingly. This ensures that Shakti-LLM generates outputs that are not only accurate but also aligned with ethical considerations and user expectations [19].

DPO(直接偏好优化)流程需要人类标注员对同一输入下的多个模型输出进行排序,这些输出根据相关性、清晰度和适用性进行评级,随后采用基于偏好的损失函数来相应调整模型权重。这一机制确保Shakti-LLM生成的输出不仅准确,同时符合伦理考量和用户预期 [19]。

For DPO, the model is fine-tuned using a learning rate of $5.0\times10^{-7}$ , with a beta coefficient of 0.01 to manage optimization momentum. The maximum prompt length is set to 1024 tokens, and the model uses AdamW as the optimizer, with gradient accumulation steps set to 2. This combination of hyper parameters ensures the model is fine-tuned efficiently while retaining high-quality responses [17].

对于 DPO (Direct Preference Optimization) 方法,模型微调采用 $5.0\times10^{-7}$ 的学习率,并通过 0.01 的 beta 系数控制优化动量。最大提示长度设为 1024 token,优化器选用 AdamW,梯度累积步数设置为 2。该超参数组合在保证响应质量的同时实现了高效微调 [17]。

4.4 Data Quality and Augmentation

4.4 数据质量与增强

Throughout its training process, Shakti-LLM emphasizes the importance of data quality over sheer volume. While some models rely on synthetic data augmentation to expand training datasets, Shakti-LLM primarily focuses on human-labeled datasets and high-quality real-world data, reducing the introduction of noise into the training process [17]. This approach ensures that the model learns meaningful patterns while maintaining computational efficiency.

在训练过程中,Shakti-LLM强调数据质量的重要性胜过单纯的数据量。虽然某些模型依赖合成数据增强来扩展训练数据集,但Shakti-LLM主要专注于人工标注数据集和高质量的真实世界数据,从而减少训练过程中噪声的引入 [17]。这种方法确保了模型在学习有意义模式的同时保持计算效率。

Shakti-LLM’s reliance on high-quality data, combined with its carefully designed training methodologies, allows the model to excel in real-world use cases, providing contextual relevance and high performance across general and domain-specific applications. By adhering to these rigorous principles in training and fine-tuning, Shakti-LLM achieves a balance between performance and efficiency, making it ideal for deployment on edge devices and in low-resource environments.

Shakti-LLM 对高质量数据的依赖,加上其精心设计的训练方法,使该模型在现实应用场景中表现出色,在通用和特定领域应用中均能提供情境相关的高性能。通过遵循这些严格的训练和微调原则,Shakti-LLM 在性能与效率之间取得了平衡,非常适合在边缘设备和资源有限的环境中部署。

5 Benchmark Comparisons

5 基准对比

To evaluate the performance of Shakti-LLM, we compared it against larger models, such as Mistral 7B [4], Phi-3 Mini-4k [5], and Llama 3 8B [2], using widely recognized NLP benchmarks. These benchmarks assess various tasks, including massive multitask language understanding, commonsense reasoning, and factual knowledge retrieval. Despite Shakti-LLM’s smaller parameter size of 2.5 billion, it achieves competitive results, even outperforming larger models in specific categories.

为了评估Shakti-LLM的性能,我们将其与更大的模型(如Mistral 7B [4]、Phi-3 Mini-4k [5]和Llama 3 8B [2])进行了比较,使用了广泛认可的NLP基准测试。这些基准测试评估了多种任务,包括大规模多任务语言理解、常识推理和事实知识检索。尽管Shakti-LLM的参数规模较小(25亿),但它取得了具有竞争力的结果,甚至在特定类别中表现优于更大的模型。

5.1 Popular Benchmarks and Results

5.1 主流基准测试与结果

The Table 2 summarizes the performance of Shakti-LLM compared to other models across key NLP benchmarks:

表 2: 总结了Shakti-LLM与其他模型在关键NLP基准测试中的性能表现:

5.2 Key Observations

5.2 关键观察

Table 2: Benchmark Comparison of Various Models. Bolded values indicate the highest scores, and underlined values indicate the second highest.

类别 基准测试 Shakti-LLM (2.5B) Phi-3 Mini-4k [5] Gemma 7B [24] Mistral 7B [4] Mistral 8x7B [4] Llama 3 8B [2]
大规模多任务语言理解 (MMLU) MMLU (5-shot) 71.7% 68.8% 63.6% 61.7% 70.5% 66.5%
常识推理 BigBenchHard (0-shot) 58.2% 71.7% 59.6% 57.3% 69.7% 51.5%
语言理解 Hellaswag (5-shot) 52.4% 76.7% 49.8% 58.5% 70.4% 71.1%
推理 PIQA (5-shot) 86.2% 84.2% 78.1% 77.7% 86.0% 75.7%
医学知识 MedQA (2-shot) 60.3% 53.8% 49.6% 50.0% 62.2% 60.5%
社会理解 Social QA (5-shot) 79.2% 76.6% 65.5% 74.6% 75.9% 73.9%
真实问答 Truthful QA (10-shot) 68.4% 65.0% 52.1% 53.0% 60.1% 63.1%
事实知识 Bool Q (0-shot) 61.1% 77.6% 66.0% 72.2% 76.6% 80.9%
琐事问答 Trivia QA (5-shot) 58.2% 64.0% 72.3% 75.2% 82.2% 67.7%

表 2: 各模型基准测试对比。加粗数值表示最高分,下划线数值表示次高分。

5.3 Insights and Interpretations

5.3 洞察与解读

The benchmark results show that Shakti-LLM provides competitive performance across a broad range of tasks, particularly in commonsense reasoning and multitask language understanding. Shakti-LLM’s efficient architecture, particularly innovations like Variable Grouped Query Attention (VGQA) and Sliding Window Attention, allows it to handle diverse tasks without the large memory footprint of models like Mistral 7B[4] or Llama 3 8B [2].

基准测试结果表明,Shakti-LLM在广泛任务中展现出竞争力,尤其在常识推理和多任务语言理解方面。其高效架构(特别是可变分组查询注意力(VGQA)和滑动窗口注意力等创新技术)使其能处理多样化任务,且无需Mistral 7B[4]或Llama 3 8B[2]等模型的大内存占用。

However, the model’s relatively modest performance in factual knowledge retrieval (as shown in Bool Q and Trivia QA) highlights a potential area for improvement. Future work could involve further fine-tuning on knowledge-heavy datasets or incorporating a broader range of factual data during pre training to enhance the model’s ability to recall and generate fact-based knowledge [17].

然而,该模型在事实性知识检索方面的表现相对一般(如Bool Q和Trivia QA所示),凸显了一个潜在的改进方向。未来的工作可能包括在知识密集型数据集上进行进一步微调,或在预训练阶段融入更广泛的事实性数据,以增强模型回忆和生成基于事实的知识的能力 [17]。

5.4 Prompt-Based Comparative Evaluation

5.4 基于提示词 (Prompt) 的对比评估

In this section, we compare Shakti’s performance to Phi-3[5] across several prompts. Table 3 showcases their respective responses to a variety of tasks, such as question answering, creative writing, and travel suggestions.

在本节中,我们将Shakti的性能与Phi-3[5]在多个提示下的表现进行对比。表3展示了它们在不同任务中的响应情况,例如问答、创意写作和旅行建议。

The comparative analysis of Shakti and Phi-3[5] demonstrates Shakti’s ability to maintain contextual accuracy and provide detailed responses across various real-world scenarios, including travel itineraries and multilingual tasks. Shakti’s creative outputs, such as poetry, exhibit linguistic richness and a deeper engagement with the language. Furthermore, Shakti’s fine-tuning for low-resource languages, such as Hindi, enhances its capability to generate culturally relevant and accurate content. These attributes reflect its ability to handle both creative and factual tasks effectively.

Shakti与Phi-3[5]的对比分析表明,Shakti能够在包括旅行行程和多语言任务在内的各种现实场景中保持上下文准确性并提供详细回答。Shakti的创意输出(如诗歌)展现了语言丰富性及更深层次的语言驾驭能力。此外,Shakti针对印地语等低资源语言的微调强化了其生成文化相关且准确内容的能力。这些特性反映出其能有效处理创意性任务与事实性任务。

Table 3: Comparative Responses of Shakti and Phi-3

提示 响应 响应
1. 用于分类一组或一族相关生物体的词是什么?两个例子分别是来自热带美洲的Clytostoma和来自东亚的Syneilesis。 用于分类一组或一族相关生物体的词是"genus"。提到的两个例子是来自热带美洲的Clytostoma和来自东亚的Syneilesis。"在暮色的丝线中 在给定的例子中,Clytostoma和Syneilesis都是植物的属。Clytostoma是来自热带美洲的木本藤蔓植物的一个属,而Syneilesis是东亚植物的一个属。
2. 写一首难以演唱的短诗。 色调 / 千个午夜的汇合重新 / 被遗忘旋律的回声 / 向风的叹息低语秘密 / 月光下的影子扭动和扭曲/将沉默占为己有 / 当星星密谋赎罪 / 为了未诉说的时刻" "在暮色的领域,阴影居住的地方 / 一个无人能讲述的迷宫般的低语 / 思绪纠缠,节奏燃烧 / 扭曲和转折的旋律 / 难以演唱却渴望的诗句。"
3. 在德克萨斯州休斯顿地区,有哪些必须与11岁孩子一起做的活动? 亮点包括休斯顿太空中心、休斯顿动物园、自然科学博物馆、Kemah木板路和Moody花园。其他选择是Buffalo Bayou公园和休斯顿植物园进行户外活动。 推荐包括休斯顿太空中心、休斯顿动物园、Kemah木板路、Buffalo Bayou公园和Discovery Green。还建议参观休斯顿儿童博物馆和休斯顿植物园。
4. 为一篇关于比更大模型更强大的非常小的语言模型的论文写一个标题,每个单词以"P"开头。 开创精确预测:证明在微小却强大的处理中的非凡表现 开创性表现:证明小巧卓越语言模型的力量 第1天:参观奥林匹克公园
5. 为2026年冬季奥运会期间的3天旅行建议一个行程。 第1天:探索米兰,参观米兰大教堂、维托里奥·埃马努埃莱二世拱廊和斯福尔扎城堡。第2天:前往科尔蒂纳丹佩佐,参观奥林匹克博物馆和Mount Faloria。第3天:参加奥运会活动并探索米兰的奥运村。 米兰并观看冰球比赛。第2天:前往科尔蒂纳丹佩佐,观看高山滑雪比赛,并参观奥林匹克博物馆。第3天:加尔达湖一日游,欣赏风景,返回科尔蒂纳。
6. 写一首关于水的诗(印地语)。 /感/售 / / /, , /
7. 分享一个印地语的谜语和答案。 )E: he :“ (

表 3: Shakti和Phi-3的响应比较

5.5 Model Inference and Performance Efficiency

5.5 模型推理与性能效率

Shakti-LLM’s inference performance was evaluated alongside Phi-3.1-mini-4k[5] across multiple hardware configurations, focusing on generating 512 tokens per prompt refer Table 3. The hardware environments included a VM configured with an AMD EPYC 7R13 processor, 30 GB RAM, NVIDIA L40s GPU, and 4 cores, as well as an Apple M3 Max with 36 GB RAM

Shakti-LLM 的推理性能与 Phi-3.1-mini-4k[5] 在多种硬件配置下进行了对比评估,重点关注每个提示生成 512 个 token 的表现 (参考表 3)。硬件环境包括配置 AMD EPYC 7R13 处理器、30 GB 内存、NVIDIA L40s GPU 和 4 核的虚拟机,以及配备 36 GB 内存的 Apple M3 Max。

Table 3: Performance comparison of different quantized language models across various hardware platforms. The table shows model names, quantization types, model sizes, and inference speeds (in tokens per second) on GPU, CPU, and Mac systems

表 3: 不同量化语言模型在各硬件平台上的性能对比。表格展示了模型名称、量化类型、模型大小以及在 GPU、CPU 和 Mac 系统上的推理速度 (单位: token/秒)

模型名称 量化类型 模型大小 GPU (token/秒) CPU (token/秒) Mac (token/秒)
Shakti Q4_KM Q4_KM 1.5 GB 331.09 18.93 128
Shakti Q5_KM Q5_KM 1.71 GB 305.89 15.90 110
Phi-3.1-mini-4k-instruct Q5_KM [5] Q5_KM 2.82 GB 163.17 8.44 74
Phi-3.1-mini-4k-instruct Q4_KM [5] Q4_KM 2.39 GB 180.4 10.72 88.21

The Shakti Q4 KM model demonstrated higher token generation speeds across all hardware, particularly excelling on GPU and Mac configurations. Despite its smaller size, it outperformed Phi-3.1 models [5], underscoring Shakti’s optimized efficiency for real-time tasks on edge devices.

Shakti Q4 KM模型在所有硬件上都展现出更高的token生成速度,尤其在GPU和Mac配置上表现突出。尽管体积更小,其性能仍超越了Phi-3.1模型 [5],凸显了Shakti在边缘设备实时任务上的优化效率。

6 Applications and Future Directions

6 应用与未来方向

Shakti-LLM is designed with versatility and s cal ability in mind, making it suitable for a wide range of real-world applications. Its lightweight architecture, optimized for on-device performance, positions it as an efficient solution in industries requiring low-latency, on-device AI such as healthcare, finance, and customer service.

Shakti-LLM 的设计兼顾多功能性和可扩展性,适用于广泛的现实应用场景。其轻量级架构专为设备端性能优化,使其成为医疗、金融和客服等需要低延迟设备端AI (Artificial Intelligence) 行业的高效解决方案。

6.1 On-Device AI for Mobile and IoT

6.1 移动与物联网设备的端侧AI (On-Device AI)

A key advantage of Shakti-LLM is its ability to operate efficiently on small devices, including smartphones, wearables, and Internet of Things (IoT) devices. Its compact size and innovative attention mechanisms make it ideal for real-time applications where latency and power consumption are critical constraints.

Shakti-LLM 的一个关键优势是能够在智能手机、可穿戴设备和物联网 (IoT) 设备等小型设备上高效运行。其紧凑的尺寸和创新的注意力机制使其非常适合对延迟和功耗有严格要求的实时应用。

6.2 Industry-Specific Use Cases

6.2 行业特定用例

Shakti-LLM’s fine-tuning on vernacular and domain-specific datasets gives it a competitive edge in industries requiring specialized knowledge. In particular, healthcare, finance, and customer service can benefit from its real-time interaction capabilities and ability to deliver accurate, con textually relevant insights.

Shakti-LLM 对方言和领域特定数据集的微调使其在需要专业知识的行业中具有竞争优势。特别是医疗保健、金融和客户服务领域,可以受益于其实时交互能力以及提供准确且上下文相关洞察的能力。

• Healthcare: Shakti-LLM can be fine-tuned for personalized healthcare advice, diagnostic support, and realtime assistance for medical professionals and patients. It can be deployed in low-resource settings where vernacular language support is essential for patient communication.

• 医疗保健:Shakti-LLM 可针对个性化医疗建议、诊断支持以及为医疗专业人员和患者提供实时辅助进行微调。它能够部署在资源匮乏的环境中,这些地区方言支持对患者沟通至关重要。

• Finance: In the financial sector, Shakti-LLM can assist in tasks like document analysis, regulatory compliance, and fraud detection, providing real-time decision-making support for professionals . • Customer Service: Shakti-LLM’s ability to support multiple languages and adapt to user-specific queries makes it a powerful tool for automating customer interactions, improving customer satisfaction and reducing response times [19].

• 金融: 在金融领域,Shakti-LLM可协助完成文件分析、监管合规和欺诈检测等任务,为专业人士提供实时决策支持。
• 客户服务: Shakti-LLM支持多语言并适应用户特定查询的能力,使其成为自动化客户互动的强大工具,可提升客户满意度并缩短响应时间 [19]。

6.3 Multilingual and Low-Resource Language Support

6.3 多语言与低资源语言支持

A defining feature of Shakti-LLM is its fine-tuning on vernacular languages, addressing the significant need for AI models to operate effectively in low-resource language environments. Large global models often under perform in regional contexts due to insufficient training on low-resource languages such as Hindi, Kannada, and Telugu. ShaktiLLM bridges this gap, positioning itself as an AI solution well-suited for multilingual environments where linguistic diversity is high [18].

Shakti-LLM 的一个显著特征是对本地语言的微调,这满足了 AI 模型在低资源语言环境中高效运行的重大需求。由于对印地语、卡纳达语和泰卢固语等低资源语言的训练不足,全球性大模型在地区性场景中往往表现欠佳。ShaktiLLM 填补了这一空白,成为非常适合语言多样性高的多语言环境的 AI 解决方案 [18]。

6.4 Future Directions

6.4 未来方向

Looking ahead, several key areas for further development could enhance Shakti-LLM’s capabilities:

展望未来,以下几个关键领域的发展可进一步提升Shakti-LLM的能力:

7 Conclusion

7 结论

In this paper, we presented Shakti-LLM, a highly efficient Small Language Model (SLM) optimized for deployment in resource-constrained environments such as smartphones and IoT systems. Shakti-LLM builds on the foundations of transformer-based architectures such as LLaMA [2], while introducing several key innovations, such as Variable Grouped Query Attention (VGQA) and SwiGLU activation s [6]. These innovations ensure high performance while maintaining a minimal computational footprint, making Shakti-LLM ideal for edge-AI applications.

本文介绍了Shakti-LLM,这是一款专为智能手机和物联网系统等资源受限环境优化的高效小语言模型(SLM)。Shakti-LLM基于Transformer架构(如LLaMA [2]),同时引入了多项关键创新,包括可变分组查询注意力(VGQA)和SwiGLU激活函数 [6]。这些创新在保持极低计算开销的同时确保了高性能,使Shakti-LLM成为边缘AI应用的理想选择。

Through Continued Pre training (CPT), Supervised Fine-Tuning (SFT), and Direct Preference Optimization (DPO), Shakti-LLM adapts to real-world needs and excels in domain-specific tasks across industries such as healthcare, finance, and customer service. Its fine-tuning on vernacular languages allows it to perform exceptionally well in low-resource environments, making it a unique solution for regions with linguistic diversity.

通过持续预训练 (CPT)、监督微调 (SFT) 和直接偏好优化 (DPO),Shakti-LLM 能够适应现实需求,并在医疗、金融和客服等行业特定任务中表现卓越。其对方言语言的微调使其在低资源环境中表现尤为出色,为语言多样性地区提供了独特的解决方案。

As Shakti-LLM continues to evolve, the integration of multimodal capabilities, improvements in code generation, and further fine-tuning for specialized domains will unlock new possibilities, making it an invaluable tool across a wide range of industries. Shakti-LLM represents a step forward in making AI more accessible, efficient, and inclusive, driving real-world impact across global industries and communities.

随着Shakti-LLM的持续发展,多模态能力的整合、代码生成的改进以及针对专业领域的进一步微调将开启新的可能性,使其成为跨行业不可或缺的工具。Shakti-LLM标志着在提升AI的可及性、效率和包容性方面迈出重要一步,为全球产业和社区带来切实影响。

References

参考文献

[1] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger,

[1] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger

阅读全文(20积分)