Intelligent Spark Agents: A Modular LangGraph Framework for Scalable, Visualized, and Enhanced Big Data Machine Learning Workflows
智能 Spark 智能体:用于可扩展、可视化和增强的大数据机器学习工作流的模块化 LangGraph 框架
Abstract. This paper presents a Spark-based modular LangGraph framework, designed to enhance machine learning workflows through scalability, visualization, and intelligent process optimization. At its core, the framework introduces Agent AI, a pivotal innovation that leverages Spark’s distributed computing capabilities and integrates with LangGraph for workflow orchestration.
摘要。本文介绍了一个基于 Spark 的模块化 LangGraph 框架,旨在通过可扩展性、可视化和智能流程优化来增强机器学习工作流。该框架的核心是引入了 AI智能体 (Agent AI),这一关键创新利用了 Spark 的分布式计算能力,并与 LangGraph 集成以实现工作流编排。
Agent AI facilitates the automation of data preprocessing, feature engineering, and model evaluation while dynamically interacting with data through Spark SQL and DataFrame agents. Through LangGraph’s graphstructured workflows, the agents execute complex tasks, adapt to new inputs, and provide real-time feedback, ensuring seamless decision-making and execution in distributed environments. This system simplifies machine learning processes by allowing users to visually design workflows, which are then converted into Spark-compatible code for high-performance execution.
Agent AI 通过 Spark SQL 和 DataFrame 智能体动态与数据交互,促进了数据预处理、特征工程和模型评估的自动化。通过 LangGraph 的图结构工作流,智能体执行复杂任务,适应新输入,并提供实时反馈,确保在分布式环境中无缝决策和执行。该系统通过允许用户可视化设计工作流,简化了机器学习流程,这些工作流随后被转换为 Spark 兼容代码以进行高性能执行。
The framework also incorporates large language models through the LangChain ecosystem, enhancing interaction with unstructured data and enabling advanced data analysis. Experimental evaluations demonstrate significant improvements in process efficiency and s cal ability, as well as accurate data-driven decision-making in diverse application scenarios.
该框架还通过 LangChain 生态系统整合了大语言模型,增强了与非结构化数据的交互能力,并实现了高级数据分析。实验评估表明,该框架在流程效率和可扩展性方面有显著提升,同时在不同应用场景中实现了准确的数据驱动决策。
This paper emphasizes the integration of Spark with intelligent agents and graph-based workflows to redefine the development and execution of machine learning tasks in big data environments, paving the way for scalable and user-friendly AI solutions.
本文强调将 Spark 与智能体 (AI Agent) 和图基工作流相结合,重新定义大数据环境中机器学习任务的开发和执行,为可扩展且用户友好的 AI 解决方案铺平道路。
Keywords: Large Language Model· Agent · LangChain · LangGraph ChatGPT · ERNIE-4 · GLM-4 · Big Data · Machine learning · Apache Spark · Data analysis.
关键词:大语言模型 · AI智能体 · LangChain · LangGraph ChatGPT · ERNIE-4 · GLM-4 · 大数据 · 机器学习 · Apache Spark · 数据分析
1 Introduction
1 引言
The development of information technology brings convenience to life and fastgrowing data. With the maturity of big data analysis technology represented by machine learning, big data has tremendous effect on social and economic life and provided a lot of help for business decision-making. For example, in the e-commerce industry, Taobao recommends suitable goods professionally to users after analyzing from large amounts of transaction data; in the advertising industry, online advertising predicts users’ preferences by tracking users’ clicks to improve users’ experience.
信息技术的发展为生活带来了便利和快速增长的数据。随着以机器学习为代表的大数据分析技术的成熟,大数据对社会和经济生活产生了巨大影响,并为商业决策提供了大量帮助。例如,在电子商务行业,淘宝通过分析大量交易数据,向用户专业推荐合适的商品;在广告行业,在线广告通过追踪用户的点击来预测用户的偏好,从而提升用户体验。
However, the traditional business relational data management system has been unable to deal with the characteristics of big data including large capacity, diversity, and high-dimension . [1] In order to solve the problem of big data analysis, distributed computing has been widely used, among which the Apache Hadoop [2] is one of the widely used distributed systems in recent years. Hadoop adopts MapReduce as a rigorous computing framework. The emergence of Hadoop has promoted the popularity of large-scale data processing platforms. Spark [3], a big data architecture developed by AMPLab of the University of Berkeley, is also widely used. Spark integrates batch analysis, flow analysis, SQL processing, graph analysis, and machine learning applications. Compared with Hadoop, Spark is fast, flexible, and fault-tolerant, which is the ideal choice to run machine learning analysis programs. However, Spark is a tool for developers, which requires analysts to have certain computer skills and spend a lot of time creating, deploying and maintaining systems.
然而,传统的关系型商业数据管理系统已无法应对大数据的大容量、多样性和高维度等特性 [1]。为了解决大数据分析的问题,分布式计算得到了广泛应用,其中 Apache Hadoop [2] 是近年来广泛使用的分布式系统之一。Hadoop 采用 MapReduce 作为严格的计算框架。Hadoop 的出现推动了大规数据处理平台的普及。由加州大学伯克利分校 AMPLab 开发的大数据架构 Spark [3] 也得到了广泛应用。Spark 集成了批处理分析、流分析、SQL 处理、图分析和机器学习应用。与 Hadoop 相比,Spark 快速、灵活且具有容错性,是运行机器学习分析程序的理想选择。然而,Spark 是面向开发者的工具,要求分析师具备一定的计算机技能,并花费大量时间创建、部署和维护系统。
The results of machine learning depend heavily on data quality and model logic, so this paper designs and implements a Park-based flow machine learning analysis tool in order to enable analysts to concentrate on the process itself and not spend energy on compiling, running, and parallel i zing the analysis program. Formally, each machine learning analysis task is decomposed into different stages and is composed of components, which reduces the user’s learning cost. In technology, general algorithms are encapsulated into component packages for reuse, and the training process is differentiated by setting parameters, which reduces the time cost of creating machine learning analysis programs. Users can flexibly organize their own analysis process by dragging and pulling algorithm components to improve the efficiency of application creation and execution.
机器学习的结果在很大程度上依赖于数据质量和模型逻辑,因此本文设计并实现了一个基于Park的流程机器学习分析工具,旨在让分析人员能够专注于流程本身,而不必花费精力在编译、运行和并行化分析程序上。形式上,每个机器学习分析任务被分解为不同的阶段,并由组件组成,这降低了用户的学习成本。在技术上,通用算法被封装成组件包以供重用,并通过设置参数来区分训练过程,从而减少了创建机器学习分析程序的时间成本。用户可以通过拖拽算法组件灵活组织自己的分析流程,以提高应用程序创建和执行的效率。
Spark, as a powerful distributed computing system, has significant advantages in the field of big data processing and analysis. With the rapid development of large model technology, the application of Spark combined with large model agents has gradually become a hot topic in research. Large model agents can more accurately perceive the environment, react and make judgments, and form and execute decisions. Supported by Spark, large model agents can process large-scale datasets, achieving efficient data analysis and decision-making. LangGraph is an agent workflow framework officially launched by LangChain, which defines workflows based on graph structures, supports complex loops and conditional branches, and provides fine-grained agent control. The integration of Spark large model agents with the advanced tools and workflow management of the Langchain and LangGraph frameworks is driving innovative applications in various fields, including large model data analysis.
Spark 作为一个强大的分布式计算系统,在大数据处理和分析领域具有显著优势。随着大模型技术的快速发展,Spark 结合大模型智能体的应用逐渐成为研究热点。大模型智能体能够更准确地感知环境、做出反应和判断,并形成和执行决策。在 Spark 的支持下,大模型智能体能够处理大规模数据集,实现高效的数据分析和决策。LangGraph 是 LangChain 官方推出的智能体工作流框架,它基于图结构定义工作流,支持复杂的循环和条件分支,并提供细粒度的智能体控制。Spark 大模型智能体与 Langchain 和 LangGraph 框架的先进工具和工作流管理相结合,正在推动包括大模型数据分析在内的各个领域的创新应用。
This paper will show the characteristics of the tool by comparing with the currently existing products, and then make a detailed description on the system architecture design, business model with use cases, and function operation by indepth system modules. At the same time, this paper will also provide a technical summary and look forward to the future research directions based on LangGraph for Spark agents.
本文将通过与现有产品的对比展示该工具的特点,并深入系统模块详细描述系统架构设计、业务模型与用例以及功能操作。同时,本文还将提供技术总结,并基于 LangGraph 对 Spark 智能体的未来研究方向进行展望。
2 Brief Introduction of Related Technologies
2 相关技术简介
2.1 AML Machine Learning Service
2.1 AML 机器学习服务
Azure Machine Learning (AML) [4] is a Web-based machine learning service launched by Microsoft on its public cloud Azure, which contains more than 20 algorithms for classification, regression, and clustering based on supervised learning and unsupervised learning, and is still increasing. However, AML is based on Hadoop and can only be used in Azure. Different from AML, the tool in this paper is designed and implemented based on Spark, therefore, it can be deployed on different virtual machines or cloud environments.
Azure Machine Learning (AML) [4] 是 Microsoft 在其公有云 Azure 上推出的基于 Web 的机器学习服务,其中包含基于监督学习和无监督学习的 20 多种分类、回归和聚类算法,并且仍在不断增加。然而,AML 基于 Hadoop,只能在 Azure 中使用。与 AML 不同,本文中的工具是基于 Spark 设计和实现的,因此可以部署在不同的虚拟机或云环境中。
2.2 Responsive Apache Zeppline Based on Spark
2.2 基于 Spark 的响应式 Apache Zeppelin
Apache Zeppline [5] is a Spark-based responsive data analysis system, whose goal is to build interactive, visualized, and shareable Web applications that integrate multiple algorithmic libraries. It has become an open source, notebookbased analysis tool that supports a large number of algorithmic libraries and languages. However, Zeppelin does not provide a user-friendly graphical interface, and all analyzers require users to write scripts to submit and run, which improves the user’s programming technical requirements. The tools mentioned in this paper provide component graphics tools and a large number of machine learning algorithms, which make users simply and quickly define the machine learning process and run to get results.
Apache Zeppelin [5] 是一个基于 Spark 的响应式数据分析系统,其目标是构建交互式、可视化且可共享的 Web 应用程序,这些应用程序集成了多种算法库。它已成为一个开源的、基于笔记本的分析工具,支持大量的算法库和语言。然而,Zeppelin 并未提供用户友好的图形界面,所有分析器都要求用户编写脚本来提交和运行,这提高了用户的编程技术要求。本文提到的工具提供了组件图形工具和大量的机器学习算法,使用户能够简单快速地定义机器学习过程并运行以获取结果。
2.3 Haflow big data Analysis Service Platform
2.3 Haflow 大数据分析服务平台
In reference [6], Haflow, a big data analysis service platform, is introduced. The system uses component design, which can make users drag and drop the process analysis program, and open an extended interface, which enables developers to create custom analysis algorithm components. At present, Haflow only supports MapReduce algorithm components of Hadoop platform but the tool mentioned in this paper is based on Haflow, so that it can support Spark’s component application and provides a large number of machine learning algorithms running in Spark environment.
在文献 [6] 中,介绍了大数据分析服务平台 Haflow。该系统采用组件化设计,用户可以通过拖拽的方式构建分析流程,并开放了扩展接口,使开发者能够创建自定义的分析算法组件。目前,Haflow 仅支持 Hadoop 平台的 MapReduce 算法组件,但本文提到的工具基于 Haflow,使其能够支持 Spark 的组件应用,并提供大量在 Spark 环境下运行的机器学习算法。
2.4 Large Language Models Usher in a New Era of Data Management and Analysis
2.4 大语言模型引领数据管理与分析的新时代
With the rapid advancement of artificial intelligence technology, large model technology has become a hot topic in the field of AI today. In the realm of data management and analysis, the emergence of large language models (LLMs), such as GPT-4o, Llama 3.2, ERNIE -4, GLM-4, and other large models, has initiated a new era filled with challenges. These large models possess powerful semantic understanding and efficient dialogue management capabilities, which have a profound impact on data ingestion, data lakes, and data warehouses. The natural language understanding functionality based on large language models simplifies data management tasks and enables advanced analysis of big data, promoting innovation and efficiency in the field of big data.
随着人工智能技术的快速发展,大模型技术已成为当今AI领域的热门话题。在数据管理和分析领域,大语言模型(LLMs)的出现,如GPT-4o、Llama 3.2、ERNIE-4、GLM-4等大模型,开启了一个充满挑战的新时代。这些大模型具备强大的语义理解和高效的对话管理能力,对数据摄取、数据湖和数据仓库产生了深远的影响。基于大语言模型的自然语言理解功能简化了数据管理任务,并实现了对大数据的深度分析,推动了大数据领域的创新和效率提升。
2.5 LangChain and LangGraph Frameworks
2.5 LangChain 和 LangGraph 框架
LangChain is a large-scale model application development framework designed for creating applications based on Large Language Models (LLMs). It is compatible with a variety of language models and integrates seamlessly with OpenAI ChatGPT. The LangChain framework offers tools and agents that "chain" different components together to create more advanced LLM use cases, including Prompt templates, LLMs, Agents, and Memory, which help users to improve efficiency when facing challenges and generating meaningful responses.
LangChain 是一个为基于大语言模型 (Large Language Models, LLMs) 创建应用程序而设计的大规模模型应用开发框架。它兼容多种语言模型,并与 OpenAI ChatGPT 无缝集成。LangChain 框架提供了工具和智能体,将不同的组件“链”在一起,以创建更高级的大语言模型用例,包括提示模板 (Prompt templates)、大语言模型、智能体 (Agents) 和记忆 (Memory),这些组件帮助用户在面对挑战时提高效率并生成有意义的响应。
LangGraph is a framework within the LangChain ecosystem that allows developers to define cyclic edges and nodes within a graph structure and provides state management capabilities. Each node in the graph makes decisions and performs actions based on the current state, enabling developers to design agents with complex control flows, including single agents, multi-agents, hierarchical structures, and sequential control flows, which can robustly handle complex scenarios in the real world.
LangGraph 是 LangChain 生态系统中的一个框架,允许开发者在图结构中定义循环边和节点,并提供状态管理功能。图中的每个节点根据当前状态做出决策并执行操作,使开发者能够设计具有复杂控制流的智能体,包括单智能体、多智能体、分层结构和顺序控制流,从而能够稳健地处理现实世界中的复杂场景。
The combination of Spark with LangChain and LangGraph allows data enthusiasts from various backgrounds to effectively engage in data-driven tasks. Through Spark SQL agents and Spark DataFrame agents, users are enabled to interact with, explore, and analyze data using natural language.
Spark 与 LangChain 和 LangGraph 的结合使得来自不同背景的数据爱好者能够有效地参与数据驱动的任务。通过 Spark SQL 智能体和 Spark DataFrame 智能体,用户可以使用自然语言与数据进行交互、探索和分析。
3 Process-oriented Machine Learning Analysis Tool Based on Spark
3 基于 Spark 的面向过程的机器学习分析工具
3.1 Machine Learning Process
3.1 机器学习流程
This paper aims at designing a process machine learning tool for data analysts, so it is necessary to implement the functions of common machine learning processes. Machine learning can be divided into supervised learning and unsupervised learning, mainly depending on whether there are specific labels. Labels are the purpose of observation data or the prediction object. Observation data are the samples used to train and test machine learning models. Feature is the attribute of observation data. Machine learning algorithm mainly trains the prediction rules from the characteristics of observation data.
本文旨在为数据分析师设计一个流程机器学习工具,因此需要实现常见机器学习流程的功能。机器学习可以分为监督学习和无监督学习,主要取决于是否存在特定的标签。标签是观察数据的目的或预测对象。观察数据是用于训练和测试机器学习模型的样本。特征是观察数据的属性。机器学习算法主要从观察数据的特征中训练预测规则。
In practice, machine learning process is composed of a series of stages, including data preprocessing, feature processing, model fitting, and result verification or prediction. For example, classify a group of text documents including word segmentation, cleaning, feature extraction, training classification model, and output classification results [7].
在实践中,机器学习过程由一系列阶段组成,包括数据预处理、特征处理、模型拟合以及结果验证或预测。例如,对一组文本文档进行分类,包括分词、清洗、特征提取、训练分类模型以及输出分类结果 [7]。
Fig. 1. Typical machine learning process.
图 1: 典型的机器学习流程。
These stages can be seen as black-box processes and packaged into components. Although there are many algorithmic libraries or software that provide programs for each stage, these programs are seldom prepared for large-scale data sets or distributed environments, and these programs are not naturally supporting process-oriented, requiring developers to connect each stage to form a complete process.
这些阶段可以视为黑箱过程并打包成组件。尽管有许多算法库或软件为每个阶段提供程序,但这些程序很少为大规模数据集或分布式环境做好准备,并且这些程序并不天然支持面向过程,需要开发者将每个阶段连接起来以形成完整流程。
Therefore, while providing a large number of machine learning algorithm components, the system also needs to complete the function of automatic process execution, taking into account the operational efficiency of the process.
因此,在提供大量机器学习算法组件的同时,系统还需要完成自动流程执行的功能,并兼顾流程的运行效率。
3.2 Design Process of Business Module
3.2 业务模块的设计流程
The system provides components to users as main business functions. The analysts can freely combine the existing components into different analysis processes. In order to be able to cover the commonly used machine learning processes, this system provides the following categories of business modules: input/output module, data preprocessing module, characteristic processing module, model fitting module, and the results predicted module. Different from other systems, the business module designed by this tool is using each stage of the process as a definition.
系统为用户提供组件作为主要业务功能。分析师可以自由地将现有组件组合成不同的分析流程。为了能够覆盖常用的机器学习流程,该系统提供了以下类别的业务模块:输入/输出模块、数据预处理模块、特征处理模块、模型拟合模块和结果预测模块。与其他系统不同,该工具设计的业务模块将流程的每个阶段作为一个定义。
- Input and output module. Used to realize data acquisition and writing, dealing with heterogeneous data sources, this module is the starting point and endpoint of the machine learning process. In order to be able to handle different types of data, this system provides data input or output functions of structured data (such as CSV data), unstructured data (such as TXT data), and semistructured data (such as HTML).
- 输入输出模块。用于实现数据采集和写入,处理异构数据源,该模块是机器学习过程的起点和终点。为了能够处理不同类型的数据,该系统提供了结构化数据(如CSV数据)、非结构化数据(如TXT数据)和半结构化数据(如HTML)的数据输入或输出功能。
- Data preprocessing module. This module includes data cleaning, filtering, the join/fork, and type change, etc. Data quality determines the upper limit of the accuracy of the machine learning model, so it is necessary to improve the data preprocessing process before feature extraction. This module can clean up null or abnormal values, change data types, and filter out unqualified data.
- 数据预处理模块。该模块包括数据清洗、过滤、连接/分叉以及类型转换等。数据质量决定了机器学习模型准确性的上限,因此在特征提取之前有必要改进数据预处理过程。该模块可以清理空值或异常值,更改数据类型,并过滤掉不合格的数据。
- Feature processing module. Feature processing is the most important link before modeling data, including feature selection and feature extraction. The system currently contains 25 commonly used feature processing algorithms.
- 特征处理模块。特征处理是数据建模前最重要的环节,包括特征选择和特征提取。系统目前包含25种常用的特征处理算法。
3.3 Feature Extraction and Classification
3.3 特征提取与分类
Feature selection is a multi-dimensional feature selection. The most valuable feature is selected by the algorithm. The selected feature is a subset of the original feature. According to the selected algorithm, it can be divided into information gain selector, chi-square information selector, and Gini coefficient selector.
特征选择是一种多维特征选择。通过算法选择最有价值的特征。所选特征是原始特征的子集。根据所选算法,可以分为信息增益选择器、卡方信息选择器和基尼系数选择器。
Feature extraction is to transform the features of observed data into new variables according to a certain algorithm. Compared with data preprocessing, the rules of data processing are more complex. The extracted features are the mapping of the original features, including the following categories:
特征提取是根据某种算法将观测数据的特征转换为新变量。与数据预处理相比,数据处理的规则更为复杂。提取的特征是原始特征的映射,包括以下几类:
- Standardized component. Standardization is an algorithm that maps numerical features of data to a unified dimension. Standardized features are unified to the same reference frame, which makes the training model more accurate and converges faster in the training process. Different standardized components use different statistics to map, such as normalizer components, Standard Scaler components, MinMax Scaler groups and so on.
- 标准化组件。标准化是一种将数据的数值特征映射到统一维度的算法。标准化后的特征被统一到相同的参考框架中,这使得训练模型更加准确,并在训练过程中更快地收敛。不同的标准化组件使用不同的统计量进行映射,例如归一化组件、标准缩放器组件、最小最大缩放器组件等。
- Text processing components. Text type features need to be mapped to new numeric type variables because they cannot be calculated directly. Common algorithms include TF-IDF components that index text by word segmentation, Tokenizer components for word segmentation, One Hot Encoder components for hot encoding, etc.
- 文本处理组件。文本类型的特征需要映射为新的数值类型变量,因为它们无法直接计算。常见的算法包括通过分词索引文本的 TF-IDF 组件、用于分词的 Tokenizer 组件、用于热编码的 One Hot Encoder 组件等。
- Dimension-reducing components. This kind of components compress the original feature information through a certain algorithm and express it with fewer features, such as PCA components of principal component analysis.
- 降维组件。这类组件通过某种算法压缩原始特征信息,并用较少的特征表示,例如主成分分析 (PCA) 的组件。
- Custom UDF components. Users can input the function of SQL custom feature processing.
- 自定义 UDF 组件。用户可以输入 SQL 自定义特征处理的函数。
- Model fitting module. Model training uses certain algorithms to learn data, and the obtained model can be used for subsequent data prediction. At present, the system provides a large number of supervised learning model components, which can be divided into classification models and regression models according to the different nature of observation data labels.
- 模型拟合模块。模型训练使用特定算法学习数据,获得的模型可用于后续数据预测。目前,系统提供了大量监督学习模型组件,根据观测数据标签性质的不同,可分为分类模型和回归模型。
- Results prediction module. This module includes two functions: results prediction and verification.
- 结果预测模块。该模块包括两个功能:结果预测和验证。
Through the above general business modules, users can create a variety of common machine learning analysis processes in the system environment.
通过上述通用业务模块,用户可以在系统环境中创建各种常见的机器学习分析流程。
The system provides a user interface through Web, and the overall architecture is mainly based on the MVC framework. At the same time, it provides business modules of machine learning and execution modules of processes. The system architecture is shown in Figure 2.
系统通过Web提供用户界面,整体架构主要基于MVC框架。同时,它提供了机器学习的业务模块和流程的执行模块。系统架构如图2所示。
Fig. 2. System architecture diagram and flow chart.
图 2: 系统架构图和流程图。
Users create formal machine learning processes through the Web interface provided by the system and submit them to the system. The system converts the received original processes into logical flow charts and verifies the validity of the flow charts. Validation of process validity is a necessary part of the analysis process before actual execution. When the process has obvious errors such as logic or data mismatch, it can immediately return the error, rather than wait for the execution of the corresponding component to report the error, which improves the efficiency of the system.
用户通过系统提供的Web界面创建正式的机器学习流程,并将其提交给系统。系统将接收到的原始流程转换为逻辑流程图,并验证流程图的有效性。流程有效性的验证是实际执行前分析过程中的必要部分。当流程存在逻辑或数据不匹配等明显错误时,系统可以立即返回错误,而不是等待相应组件的执行来报告错误,从而提高了系统的效率。
The execution engine of the system is the key module, which implements the multi-user and multi-task process execution function. It translates the validated logical flow chart into the corresponding execution model, which is the data structure identifiable by the system and used to schedule the corresponding business components. The translation of the execution model is a complex process, which will be introduced in detail in Section 4.3.
系统的执行引擎是关键模块,它实现了多用户和多任务的流程执行功能。它将经过验证的逻辑流程图转换为相应的执行模型,这是系统可识别的数据结构,用于调度相应的业务组件。执行模型的转换是一个复杂的过程,将在第4.3节中详细介绍。
3.5 Architecture Design of Spark Agent Based on LangGraph
3.5 基于 LangGraph 的 Spark Agent 架构设计
As shown in Figure 3, the architecture of the agent implementation that combines Apache Spark with LangChain and LangGraph is designed to enhance the level of intelligence in data processing and decision-making. This architectural diagram displays a Spark-based agent system capable of accomplishing complex tasks by integrating various technological components.
如图 3 所示,将 Apache Spark 与 LangChain 和 LangGraph 结合的智能体实现架构旨在提升数据处理和决策的智能化水平。该架构图展示了一个基于 Spark 的智能体系统,能够通过集成多种技术组件来完成复杂任务。
Fig. 3. Architectural Design of Spark Agent Based on LangGraph.
图 3: 基于 LangGraph 的 Spark Agent 架构设计
LangChain and LangGraph frameworks are introduced on top of Spark. LangChain is a framework for building and deploying large language models, enabling agents to interact with users or other systems through natural language. LangGraph provides a graph-structured workflow, allowing agents to plan and execute tasks in a more flexible manner. The core of the agent is the LLM (Large Language Model), which is responsible for understanding and generating natural language. The LLM receives instructions (Instruction), observations (Observation), and feedback (FeedBack) from LangChain and integrates these with data insights from Spark to form thoughts (Thought). These thoughts are then translated into specific action plans (Plan and Action), which are executed by the action executor (Action Executor).
LangChain 和 LangGraph 框架在 Spark 之上被引入。LangChain 是一个用于构建和部署大语言模型的框架,使 AI智能体能够通过自然语言与用户或其他系统进行交互。LangGraph 提供了一个图结构的工作流,允许 AI智能体以更灵活的方式规划和执行任务。AI智能体的核心是大语言模型 (LLM),它负责理解和生成自然语言。LLM 从 LangChain 接收指令 (Instruction)、观察 (Observation) 和反馈 (FeedBack),并将这些与来自 Spark 的数据洞察相结合,形成思考 (Thought)。这些思考随后被转化为具体的行动计划 (Plan and Action),并由行动执行器 (Action Executor) 执行。
The action executor is tightly integrated with Spark, processing structured data through Spark DataFrame Agent and Spark SQL Agent, performing complex data analysis and data processing tasks, and feeding the results back to the LLM. The LLM adjusts its action plans based on this feedback and new observations, forming a closed-loop learning and optimization process that further refines its performance and decision-making quality.
动作执行器与Spark紧密集成,通过Spark DataFrame Agent和Spark SQL Agent处理结构化数据,执行复杂的数据分析和数据处理任务,并将结果反馈给大语言模型。大语言模型根据此反馈和新观察调整其行动计划,形成一个闭环学习和优化过程,进一步提升其性能和决策质量。
4 Research on System Implementation and Key Technologies
4 系统实现与关键技术研究
4.1 Storage of Intermediate Data
4.1 中间数据的存储
Frame-based Contingent Storage Architecture In the whole process of machine learning, data is in the state of flow, and components with sequence dependence need to transfer intermediate data. In order to avoid the problem of heterogeneity of intermediate data, the system stipulates that components communicate with each other using a unified determinant storage structure based on DataFrame [8], which is a Spark-supported distributed data set and is classified as the main data set. Conceptually, it is similar to the "table" of relational database, but a lot of optimization has been done on its operation execution at Spark. In this way, the relationship of structured data is retained, special data attributes are defined, features and label are specified as the head of data required in the model fitting stage, so as to facilitate the validation and execution of the process.
基于帧的应急存储架构
在机器学习的整个过程中,数据处于流动状态,具有序列依赖性的组件需要传输中间数据。为了避免中间数据的异构性问题,系统规定组件之间使用基于DataFrame的统一确定性存储结构进行通信 [8],这是一种Spark支持的分布式数据集,被归类为主数据集。从概念上讲,它类似于关系数据库的“表”,但在Spark上对其操作执行进行了大量优化。通过这种方式,保留了结构化数据的关系,定义了特殊的数据属性,并将特征和标签指定为模型拟合阶段所需的数据头,以便于流程的验证和执行。
This determinant storage structure can be quickly persisted to the intermediate data storage layer by the whole system, and quickly restored to the required data objects when the later components are used.
这种确定性存储结构可以被整个系统快速持久化到中间数据存储层,并在后续组件使用时快速恢复到所需的数据对象。
Alluxio Virtual Distributed Storage System Intermediate data needs different management in different life cycles. When components process the previous data, that is to say, during the generation phase of intermediate data, the system records the generation location of intermediate data and transfers it to the next component. After the execution of the process, all the intermediate data generated by the process will no longer be used and will be deleted by the system. At the same time, the intermediate data storage space of a single process has a specified upper limit. When too much intermediate data is generated, the resource manager of the process will use the Least Recently Used algorithm (LRU) [9]to clear the data, so as to prevent the overflow of memory caused by too much intermediate data.
Alluxio 虚拟分布式存储系统
中间数据在不同的生命周期需要不同的管理。当组件处理前一个数据时,即在中间数据的生成阶段,系统会记录中间数据的生成位置并将其传递给下一个组件。在流程执行完毕后,该流程生成的所有中间数据将不再使用,并由系统删除。同时,单个流程的中间数据存储空间有指定的上限。当生成的中间数据过多时,流程的资源管理器会使用最近最少使用算法 (LRU) [9] 来清理数据,以防止因中间数据过多而导致内存溢出。
In order to ensure the IO efficiency of intermediate data, Alluxio[10] is used as the intermediate storage reservoir to store all the intermediate data in memory. Alluxio is a virtual distributed storage system based on memory, which can greatly accelerate the speed of data reading and writing.
为了确保中间数据的 IO 效率,Alluxio[10] 被用作中间存储池,将所有中间数据存储在内存中。Alluxio 是一个基于内存的虚拟分布式存储系统,可以极大地加速数据读写速度。
4.2 Implementation Method of Business Components of Machine Learning
4.2 机器学习业务组件的实现方法
Machine Learning Analysis Components Based on SparkMLlib In section 3.2, the design of the machine learning module is described in detail. These modules complete main data processing and modeling functions in the form of components. In order to quickly provide as many algorithmic components as possible, only a small part of the processing program components are programmed according to the characteristics of the machine learning process, such as input and output components, data cleaning components, and so on, and a large number of the component functions are automatically converted to Spark jobs using Spark MLlib [11], which is Spark’s own machine learning algorithm library, containing a large number of classification, regression, clustering, dimensionality reduction, and other algorithms. For example, to classify with the help of Random Forest, the Random Forest Classifier object with corresponding parameters is instantiated by the execution engine of the system according to the node information of the process. The fit method is used to fit the input data, and the corresponding Model object is generated. Then the model is serialized and saved through the intermediate data management module for subsequent prediction or verification components to use.
基于 Spark MLlib 的机器学习分析组件
在第 3.2 节中,详细描述了机器学习模块的设计。这些模块以组件的形式完成主要的数据处理和建模功能。为了快速提供尽可能多的算法组件,只有一小部分处理程序组件是根据机器学习过程的特点进行编程的,例如输入输出组件、数据清洗组件等,而大量的组件功能是通过 Spark MLlib [11] 自动转换为 Spark 作业的。Spark MLlib 是 Spark 自带的机器学习算法库,包含了大量的分类、回归、聚类、降维等算法。例如,要借助随机森林进行分类,系统的执行引擎会根据流程的节点信息实例化具有相应参数的随机森林分类器对象。使用 fit 方法对输入数据进行拟合,并生成相应的模型对象。然后通过中间数据管理模块将模型序列化并保存,以供后续的预测或验证组件使用。
Components of Sharing in Spark Context Execution Process There are two ways to run components in the process. One is to call as an independent Spark program and start the Spark Context once for each run. When the Spark program starts, it creates a context environment to determine the allocation of resources, such as how many threads and memory to be called, and then schedule tasks accordingly. The general machine learning process is composed of many components. It will take a lot of running time to start and switch the context. Another way is to share the same context for each process. The whole process can be regarded as a large Spark program. However, the execution engine of the system needs to create and manage the context for each process and release the context object to recover resources at the end of the process.
Spark 上下文执行过程中的共享组件
在过程中运行组件有两种方式。一种是将每个组件作为独立的 Spark 程序调用,并在每次运行时启动一次 Spark 上下文。当 Spark 程序启动时,它会创建一个上下文环境来确定资源的分配,例如调用多少线程和内存,然后相应地调度任务。一般的机器学习过程由许多组件组成。启动和切换上下文会花费大量的运行时间。另一种方式是为每个进程共享相同的上下文。整个过程可以看作是一个大型的 Spark 程序。然而,系统的执行引擎需要为每个进程创建和管理上下文,并在进程结束时释放上下文对象以回收资源。
In order to achieve context sharing, each component inherits Spark Job Life or its subclasses and implements the methods create Instance and execute. Figure 4 is the design and inheritance diagram of components classification, among which Transformers, Models, and Predictors are respectively the parent of data cleaning and data preprocessing model, learning and training model, validation and prediction model.
为了实现上下文共享,每个组件都继承 Spark Job Life 或其子类,并实现 createInstance 和 execute 方法。图 4 是组件分类的设计和继承图,其中 Transformers、Models 和 Predictors 分别是数据清洗和数据预处理模型、学习和训练模型、验证和预测模型的父类。
Fig. 4. Component class design and inheritance diagram.
图 4: 组件类设计和继承图。
4.3 Logical Analysis Flow of Machine Learning
4.3 机器学习的逻辑分析流程
Process Creation After the user has designed and submitted the machine learning analysis process through the graphical interface, the system will start to create the logical analysis process. First, the system will make a topological analysis of the original process and generate the logical flow char