AN EFFICIENT APPROACH TO REPRESENT ENTERPRISE WEBAPPLICATION STRUCTURE USING LARGE LANGUAGE MODEL IN THE SERVICE OF INTELLIGENT QUALITY ENGINEERING

利用大语言模型高效表示企业Web应用程序结构以服务于智能质量工程

ABSTRACT

摘要

This paper presents a novel approach to represent enterprise web application structures using Large Language Models (LLMs) to enable intelligent quality engineering at scale. We introduce a hierarchical representation methodology that optimizes the few-shot learning capabilities of LLMs while preserving the complex relationships and interactions within web applications. The approach encompasses five key phases: comprehensive DOM analysis, multi-page synthesis, test suite generation, execution, and result analysis. Our methodology addresses existing challenges around usage of Generative AI techniques in automated software testing by developing a structured format that enables LLMs to understand web application architecture through in-context learning. We evaluated our approach using two distinct web applications: an e-commerce platform (Swag Labs) and a healthcare application (MediBox) which is deployed within Atalgo engineering environment. The results demonstrate success rates of $90%$ and $70%$ , respectively, in achieving automated testing, with high relevance scores for test cases across multiple evaluation criteria. The findings suggest that our representation approach significantly enhances LLMs’ ability to generate con textually relevant test cases and provide better quality assurance overall, while reducing the time and effort required for testing.

本文提出了一种利用大语言模型 (LLMs) 表示企业 Web 应用程序结构的新方法，以实现大规模智能质量工程。我们引入了一种分层表示方法，该方法优化了 LLMs 的少样本学习能力，同时保留了 Web 应用程序中的复杂关系和交互。该方法包括五个关键阶段：全面的 DOM 分析、多页面合成、测试套件生成、执行和结果分析。我们的方法通过开发一种结构化格式，使 LLMs 能够通过上下文学习理解 Web 应用程序架构，从而解决了在自动化软件测试中使用生成式 AI 技术的现有挑战。我们使用两个不同的 Web 应用程序评估了我们的方法：一个电子商务平台 (Swag Labs) 和一个部署在 Atalgo 工程环境中的医疗保健应用程序 (MediBox)。结果表明，在实现自动化测试方面，成功率分别为 $90%$ 和 $70%$，并且在多个评估标准中测试用例的相关性得分较高。研究结果表明，我们的表示方法显著增强了 LLMs 生成上下文相关测试用例的能力，并提供了更好的整体质量保证，同时减少了测试所需的时间和精力。

Keywords Large Language Model (LLM) $\cdot$ In-Context Learning $\cdot$ Document Object Model (DOM) $\cdot$ Generative AI $\cdot$ Hierarchical Representation $\cdot$ Enterprise Test Automation $\cdot$ Intelligent Quality Engineering

关键词大语言模型 (LLM) $\cdot$ 上下文学习 $\cdot$ 文档对象模型 (DOM) $\cdot$ 生成式 AI (Generative AI) $\cdot$ 层次表示 $\cdot$ 企业测试自动化 $\cdot$ 智能质量工程

1 Introduction

1 引言

Enterprise web applications constitute the fundamental infrastructure that orchestrates intricate organizational processes and facilitates multiple user interactions fulfilling thee processes [1]. These applications have transcended their traditional role as mere digital interfaces to become critical determinants of operational excellence and market competitiveness for an enterprise. The exponential growth in application complexity, coupled with heightened user expectations, has elevated quality engineering (QE) to a position of paramount significance in the software development lifecycle. This emerging paradigm encompasses a comprehensive framework of methodologies and practices designed to ensure robust functionality, seamless s cal ability, and sustainable maintainability of enterprise web applications. Within this context, the accurate representation and analysis of architectural intricacies in enterprise web applications has emerged as crucial elements in achieving holistic quality assurance objectives. When it comes to formal computational model of quality engineering processes, precise and formal structural representation of an enterprise application becomes critical. There is a strong correlation between accurate architectural comprehension and an autonomous system’s ability to provide acceptable level of quality assurance [2]. The evolution of enterprise web applications from monolithic structures to complex, distributed systems has created an imperative for more sophisticated methods of automated structural analysis before applying quality assurance processes. Software testing represents a critical aspect of a software engineering lifecycle, serving as a determinant of system reliability, functional integrity, and overall quality of the software that will eventually support key processes of a business at scale. As contemporary software systems grow increasingly sophisticated, the imperative for comprehensive testing methodologies becomes progressively more pronounced. Beyond the conventional paradigm of defect identification and vulnerability remediation, robust testing frameworks validate system compliance across diverse operational environments while ensuring adherence to specified requirements. This multifaceted approach serves as a crucial safeguard against system failures, security breaches, and compromised user experiences. The evolution of AI augmented intelligent quality engineering practices transcends mere defect detection, managing the entire lifecycle independently and autonomously [3]. The recent applications of AI technologies have created significant interest from research as well as business community to explore how a formal and computational model of quality engineering can provide more efficient quality assurance [4]. Recent AI algorithms and technology stack are showing substantial promise to the testing ecosystem through automation capabilities, adaptive learning mechanisms, and predictive analytics. This synergy between AI and testing methodologies has significant potential impact on traditional approaches by optimizing test procedures, minimizing manual intervention, and enabling intelligent test case generation coupled with automated anomaly detection systems. Natural Language Processing (NLP), a specialized domain within AI, has initiated a transformation in software testing practices by introducing sophisticated linguistic comprehension capabilities to the testing environment [5]. Generative AI based methodologies enable the formal interpretation of the requirements written in natural language or semi formal format [6]. This facilitates context-aware testing strategies which are a paradigm shift from traditional testing approaches. By incorporating semantic understanding into the testing framework, new approach enables the intelligent design of test cases based on specifications and user feedback analysis autonomously. This not only expedites the identification of ambiguities and inconsistencies but also frees up the time of the software development team to be able to focus on other areas [7].

企业 Web 应用程序构成了协调复杂组织流程并促进多个用户交互以满足这些流程的基础设施 [1]。这些应用程序已经超越了其作为数字接口的传统角色，成为企业运营卓越性和市场竞争力的关键决定因素。应用程序复杂性的指数级增长，加上用户期望的提高，使质量工程 (QE) 在软件开发生命周期中占据了至关重要的地位。这一新兴范式包含了一套全面的方法论和实践框架，旨在确保企业 Web 应用程序的稳健功能、无缝可扩展性和可持续的可维护性。在此背景下，企业 Web 应用程序中架构复杂性的准确表示和分析已成为实现全面质量保证目标的关键要素。在质量工程过程的正式计算模型中，企业应用程序的精确和正式的结构表示变得至关重要。准确的架构理解与自主系统提供可接受质量保证水平的能力之间存在强相关性 [2]。企业 Web 应用程序从单体结构向复杂分布式系统的演变，使得在应用质量保证过程之前，需要更复杂的自动化结构分析方法。软件测试代表了软件工程生命周期中的一个关键方面，作为系统可靠性、功能完整性和最终支持大规模业务关键流程的软件整体质量的决定因素。随着当代软件系统变得越来越复杂，全面测试方法的需求变得越来越明显。除了传统的缺陷识别和漏洞修复范式之外，稳健的测试框架还验证了系统在不同操作环境中的合规性，同时确保遵守指定的要求。这种多方面的方法作为防止系统故障、安全漏洞和用户体验受损的关键保障。AI 增强的智能质量工程实践的演变超越了单纯的缺陷检测，独立自主地管理整个生命周期 [3]。最近 AI 技术的应用引起了研究和商业界的极大兴趣，探索质量工程的正式计算模型如何提供更高效的质量保证 [4]。最近的 AI 算法和技术栈通过自动化能力、自适应学习机制和预测分析，为测试生态系统展示了巨大的前景。AI 与测试方法之间的协同作用对传统方法具有显著的潜在影响，通过优化测试程序、最小化手动干预以及实现智能测试用例生成与自动异常检测系统的结合。自然语言处理 (NLP) 作为 AI 中的一个专门领域，通过向测试环境引入复杂的语言理解能力，开启了软件测试实践的变革 [5]。基于生成式 AI 的方法使得能够正式解释以自然语言或半正式格式编写的需求 [6]。这促进了上下文感知的测试策略，这是对传统测试方法的范式转变。通过将语义理解纳入测试框架，新方法能够基于规范和用户反馈分析自主设计智能测试用例。这不仅加速了歧义和不一致性的识别，还释放了软件开发团队的时间，使其能够专注于其他领域 [7]。

Nevertheless, as per our research work as part of developing a computational intelligence-based quality engineering platform, we found that the task of effectively capturing and representing the intricate architectural patterns and interaction paradigms within enterprise web applications presents formidable challenges. The heterogeneous nature of development frameworks, technological stacks, and architectural design patterns introduces substantial complexity in modeling application structures with a level of precision we need such systems to work autonomously. Contemporary methodologies frequently demonstrate limitations in providing dynamic and adaptive representations, thus constraining their efficacy in addressing the sophisticated requirements of modern enterprise systems as the system change occurs periodically. This introduces growing challenges in the quality assurance area as existing script for testing might break and provide false result that might lead to false assumptions and failure to detect crucial bugs in the software systems as the system gets updated periodically [8]. The NLP based solutions for test automation tried to solve this problem with self-healing. Self-healing approaches were effective for small changes in the software system but failed when massive changes were introduced to software systems [9]. Also, NLP based solutions helped only on test script maintenance but not in initial test script generation or test result reporting. The reason behind was the lack of overall system understanding of the NLP based systems [10]. The emergence of Large Language Models (LLMs), particularly the Generative Pre-trained Transformer (GPT) architecture, has presented us with unprecedented opportunities within the software engineering domain [11]. These models exhibit exceptional prowess in natural language processing and demonstrate remarkable capabilities in generating con textually pertinent and semantically enriched content. Such capabilities can be strategically leveraged to autonomously generate comprehensive representations of web application structures through the systematic interpretation of user interactions, technical documentation, and source code artifacts. The integration of LLMs enables the creation of intelligent, adaptive representations that can serve as a robust foundation for quality engineering initiatives, encompassing automated testing frameworks, anomaly detection mechanisms, and performance optimization strategies [12]. This research presents a novel methodological framework for representing enterprise web application structures through the strategic deployment of LLMs. Our proposed approach transcends traditional static modeling by not only capturing fundamental elements such as page hierarchies and navigation patterns but also dynamically modeling complex user interactions and system behaviors [13]. The integration of LLM-driven representation mechanisms into established quality engineering workflows offers organizations substantial opportunities for operational efficiency enhancement and quality assurance optimization.

然而，作为我们开发基于计算智能的质量工程平台的研究工作的一部分，我们发现，有效捕捉和表示企业Web应用程序中复杂的架构模式和交互范式任务面临着巨大的挑战。开发框架、技术栈和架构设计模式的异质性在建模应用结构时引入了显著的复杂性，而我们需要这些系统能够自主工作。当代方法在提供动态和自适应表示方面经常表现出局限性，从而限制了它们在满足现代企业系统复杂需求方面的效能，尤其是在系统周期性变化时。这给质量保证领域带来了日益增长的挑战，因为现有的测试脚本可能会失效并提供错误的结果，这可能导致错误的假设和未能检测到软件系统中的关键错误，尤其是在系统周期性更新时 [8]。基于自然语言处理（NLP）的测试自动化解决方案试图通过自愈来解决这个问题。自愈方法在软件系统发生小变化时有效，但在软件系统发生大规模变化时失效 [9]。此外，基于NLP的解决方案仅在测试脚本维护方面有所帮助，而在初始测试脚本生成或测试结果报告方面并无帮助。其背后的原因是基于NLP的系统缺乏对整体系统的理解 [10]。大语言模型（LLMs）的出现，特别是生成式预训练Transformer（GPT）架构，为我们在软件工程领域提供了前所未有的机会 [11]。这些模型在自然语言处理方面表现出卓越的能力，并在生成上下文相关且语义丰富的内容方面展示了显著的能力。这些能力可以战略性地用于通过系统解释用户交互、技术文档和源代码工件来自主生成Web应用程序结构的全面表示。LLMs的集成使得创建智能、自适应的表示成为可能，这些表示可以作为质量工程计划的坚实基础，涵盖自动化测试框架、异常检测机制和性能优化策略 [12]。本研究提出了一种通过战略部署LLMs来表示企业Web应用程序结构的新方法框架。我们提出的方法超越了传统的静态建模，不仅捕捉了页面层次结构和导航模式等基本元素，还动态建模了复杂的用户交互和系统行为 [13]。将LLM驱动的表示机制集成到现有的质量工程工作流程中，为组织提供了增强运营效率和优化质量保证的实质性机会。

This paper presents an efficient approach to leveraging Large Language Models for representing enterprise web application structures, specifically focusing on their application in quality engineering [13]. Our approach addresses several key challenges identified in current literature: the need for dynamic representation of application structure, the ability to capture complex relationships between components, the hierarchical flow of navigation elements and the integration of this representation into existing quality engineering processes. By utilizing LLMs’ natural language understanding capabilities, we propose a method that bridges the gap between human understanding and machineprocess able representation of web application architectures. Thus, in this research, we have showcased all of the aspects of a software testing process for a web application; test case generation [14], test script maintenance and reporting of the test results. The significance of this research lies in its potential to enhance quality engineering practices by effectively employing generative AI based computational intelligence through improved understanding and representation of enterprise web applications. Our approach not only addresses the limitations of current methods but also provides a foundation for more intelligent and adaptive quality engineering processes in the context of modern web development

本文提出了一种利用大语言模型（Large Language Model, LLM）表示企业Web应用结构的有效方法，特别关注其在质量工程中的应用 [13]。我们的方法解决了当前文献中提出的几个关键挑战：应用结构的动态表示需求、捕捉组件间复杂关系的能力、导航元素的层次流以及将该表示集成到现有质量工程流程中。通过利用LLM的自然语言理解能力，我们提出了一种弥合人类理解与机器可处理Web应用架构表示之间差距的方法。因此，在本研究中，我们展示了Web应用软件测试过程的所有方面：测试用例生成 [14]、测试脚本维护和测试结果报告。本研究的意义在于，通过基于生成式AI（Generative AI）的计算智能，有效提升对企业Web应用的理解和表示，从而增强质量工程实践。我们的方法不仅解决了当前方法的局限性，还为现代Web开发背景下更智能和自适应的质量工程流程奠定了基础。

2 背景与相关工作

Recent years have witnessed growing research interest in converting natural language requirements automatically into functional test scripts, driven by the increasing adoption of agile development and continuous integration practices. This review examines key developments in the field, with a particular focus on systematic analyses, approaches, and automation tools that enable requirements to be transformed into executable test scripts. The systematic literature review by Mustafa et al. [15] represents seminal work in this domain. Their analysis presents a structured overview of different automated test generation approaches derived from requirements specifications. A key finding emphasizes how test generation approaches must be carefully matched to handle the inherent characteristics of requirements, including their potential ambiguity and incompleteness. Building on this foundation, some researchers [16] developed an extensive classification system for requirement-based test automation techniques, while also identifying critical research gaps and obstacles that need to be addressed to improve these methods’ efficacy. Both research efforts highlight the critical need to develop sophisticated frameworks capable of accurately processing natural language requirements and producing corresponding test scripts. Taking a more practical perspective, Chr ys a lid is et al. developed an innovative semi-automated toolchain system that converts natural language requirements into executable code specifically for flight control platforms [17]. Their work illustrates how domain knowledge can be effectively combined with automation to streamline test script creation. The system enables engineers to structure requirements in modules that can then be automatically transformed into executable tests, providing a concrete implementation of theoretical concepts outlined in systematic reviews. The research by Koroglu & S¸en breaks new ground by applying reinforcement learning techniques to generate functional tests from UI test scenarios written in human-friendly languages like Gherkin [18]. Their approach tackles the complex challenge of converting high-level declarative requirements into concrete test scripts, effectively connecting natural language specifications with automated testing frameworks. Their research suggests machine learning can substantially improve both the precision and efficiency of test script generation. The literature also emphasizes the importance of automation tools like Selenium, with Rus dian s yah outlining optimal practices for web testing using Selenium, highlighting its capabilities for automating user interactions and enhancing test accuracy [19]. This aligns with Zasornova’s observations that automated testing enables improved efficiency and faster detection of defects - crucial factors in contemporary software development [20]. Yutia presents an alternative approach through keyword-driven frameworks for automated functional testing. This methodology enables testers to develop succinct, adaptable test cases - particularly valuable when working with evolving natural language requirements [21]. The previous works on automated test script generation demonstrates consistent efforts to advance the script generation from natural language requirements. The combination of systematic reviews, practical toolchains, and sophisticated approaches including reinforcement learning and keyword-driven frameworks shows a comprehensive strategy for addressing industry challenges. Malik et al. [22] made significant contributions through their work on automating test oracles from restricted natural language agile requirements. Their proposed Restricted Natural Language Agile Requirements Testing (ReNaLART) methodology employs structured templates to facilitate test case generation. This research demonstrates how Large Language Models can effectively interpret and transform natural language requirements into functional test scripts, addressing inherent language ambiguities. Additionally, Kıcı et al. [23] investigated using BERT-based transfer learning to classify software requirements specifications. Their research reveals that Large Language Models can substantially enhance requirements understanding through classification, which subsequently aids in automated test script generation. Their findings emphasize the potential impact of Large Language Models in improving both accuracy and efficiency in test script generation processes. Raharjana et al. [24] conducted a systematic literature review examining how user stories and NLP techniques are utilized in agile software development. Their analysis demonstrates how NLP enhanced by Large Language Models can extract testable requirements from user stories for conversion into automated test scripts, reflecting the increasing adoption of LLMs to connect natural language requirements with automated testing frameworks. Liu et al. [25] introduced MuF BD Tester, an innovative mutation-based system for generating test sequences for function block diagram programs. While their work centers on mutation testing, the core principles of generating test sequences from specifications can be enhanced through LLM integration. The ability of LLMs to parse complex specifications enables more efficient test sequence generation aligned with intended software functionality. Complementing this research, explores automated testing challenges in complex software systems, suggesting that LLMs could play a vital role in generating test cases that accurately reflect user requirements and enhance testing processes. Large Language Models like GPT-3 and its successors have shown exceptional capability in natural language understanding and generation, making them ideal for automated test script creation. Ayenew explores NLP’s potential for automated test case generation from software requirements, emphasizing the efficiency benefits of automation [26]. These findings align with Leotta et al., who demonstrate that NLP-based test automation tools can dramatically reduce test case creation time, making testing more accessible to professionals without extensive programming expertise [27]. Wang et al. present an NLP-driven approach for generating acceptance test cases from use case specifications, showing how recent NLP advances facilitate test scenario identification and formal constraint generation, thereby improving the accuracy of test scripts derived from natural language requirements [28]. Beyond LLMs, researchers have explored various NLP applications in report generation across different domains. Chillakuru et al. examine NLP’s role in automating neuro radiology MRI protocols, demonstrating its ability to convert unstructured text into structured reports [29]. This capability translates well to software testing, where NLP can extract test scenarios from natural language requirements to streamline report generation. Bae et al. further demonstrate NLP’s versatility in automatically extracting quality indicators from free-text reports, a capability that can ensure generated test scripts align with quality standards specified in natural language requirements [30]. Similarly, Tignanelli et al. highlight the use of NLP techniques to automate the characterization of treatment appropriateness in emergency medical services, emphasizing the potential for NLP to enhance the quality and relevance of automated reports in various domains [31]. Despite these advancements, several research gaps remain in the application of LLMs and NLP techniques for functional test automation. One significant gap is the need for empirical validation of the effectiveness of NLP-based tools compared to traditional testing methods. Leotta et al. note that while many NLP-based tools have been introduced, their superiority has not been rigorously tested in practice [32]. Additionally, there is a lack of comprehensive frameworks that integrate LLMs with existing testing methodologies, which could facilitate a more seamless transition from natural language requirements to automated test scripts. Furthermore, the challenge of handling ambiguous or incomplete requirements in natural language remains a critical issue. While LLMs have shown promise in interpreting complex text, the variability in natural language can lead to misinterpretations that affect the quality of generated test scripts. Research by Jen-Tse et al. Jen-tse et al. indicates that many generated test cases may not preserve the intended semantic meaning, leading to high false alarm rates. Addressing these challenges through improved training methodologies and more robust NLP techniques is essential for enhancing the reliability of automated test generation [32].

近年来，随着敏捷开发和持续集成实践的日益普及，将自然语言需求自动转换为功能测试脚本的研究兴趣逐渐增长。本文回顾了该领域的关键发展，特别关注能够将需求转化为可执行测试脚本的系统分析、方法和自动化工具。Mustafa 等人 [15] 的系统文献综述是该领域的开创性工作。他们的分析提供了从需求规范中衍生出的不同自动化测试生成方法的结构化概述。一个关键发现强调了测试生成方法必须仔细匹配以处理需求的固有特性，包括其潜在的模糊性和不完整性。在此基础上，一些研究人员 [16] 开发了一个基于需求的测试自动化技术的广泛分类系统，同时确定了需要解决的关键研究差距和障碍，以提高这些方法的有效性。这两项研究都强调了开发能够准确处理自然语言需求并生成相应测试脚本的复杂框架的迫切需求。

从更实际的角度出发，Chrysalidis 等人开发了一种创新的半自动化工具链系统，将自然语言需求转换为专门用于飞行控制平台的可执行代码 [17]。他们的工作展示了如何有效地将领域知识与自动化相结合，以简化测试脚本的创建。该系统使工程师能够将需求模块化，然后自动转换为可执行测试，提供了系统综述中概述的理论概念的具体实现。Koroglu 和 Şen 的研究通过应用强化学习技术从用 Gherkin 等人性化语言编写的 UI 测试场景中生成功能测试，开辟了新天地 [18]。他们的方法解决了将高级声明性需求转换为具体测试脚本的复杂挑战，有效地将自然语言规范与自动化测试框架连接起来。他们的研究表明，机器学习可以显著提高测试脚本生成的精度和效率。

文献还强调了 Selenium 等自动化工具的重要性，Rusdiansyah 概述了使用 Selenium 进行 Web 测试的最佳实践，强调了其在自动化用户交互和提高测试准确性方面的能力 [19]。这与 Zasornova 的观察一致，即自动化测试能够提高效率并更快地检测缺陷——这是当代软件开发中的关键因素 [20]。Yutia 提出了一种通过关键字驱动框架进行自动化功能测试的替代方法。这种方法使测试人员能够开发简洁、适应性强的测试用例——在处理不断变化的自然语言需求时特别有价值 [21]。

先前关于自动化测试脚本生成的研究表明，从自然语言需求中推进脚本生成的一致努力。系统综述、实用工具链以及包括强化学习和关键字驱动框架在内的复杂方法的结合，展示了解决行业挑战的全面策略。Malik 等人 [22] 通过他们的工作为从受限自然语言敏捷需求中自动化测试预言做出了重要贡献。他们提出的受限自然语言敏捷需求测试 (ReNaLART) 方法采用结构化模板来促进测试用例生成。这项研究展示了大语言模型如何有效地解释自然语言需求并将其转换为功能测试脚本，解决了固有的语言模糊性。此外，Kıcı 等人 [23] 研究了使用基于 BERT 的迁移学习对软件需求规范进行分类。他们的研究表明，大语言模型可以通过分类显著增强需求理解，从而有助于自动化测试脚本生成。他们的发现强调了大语言模型在提高测试脚本生成过程的准确性和效率方面的潜在影响。

Raharjana 等人 [24] 进行了一项系统文献综述，研究了用户故事和 NLP 技术在敏捷软件开发中的应用。他们的分析表明，通过大语言模型增强的 NLP 如何从用户故事中提取可测试需求，并将其转换为自动化测试脚本，反映了大语言模型在连接自然语言需求与自动化测试框架方面的日益普及。Liu 等人 [25] 引入了 MuFBD Tester，这是一种创新的基于突变的系统，用于生成功能块图程序的测试序列。虽然他们的工作集中在突变测试上，但通过大语言模型集成可以增强从规范生成测试序列的核心原则。大语言模型解析复杂规范的能力使得能够更高效地生成与预期软件功能一致的测试序列。

补充这项研究的是，探索了复杂软件系统中的自动化测试挑战，表明大语言模型可以在生成准确反映用户需求并增强测试过程的测试用例中发挥重要作用。像 GPT-3 及其后继者这样的大语言模型在自然语言理解和生成方面表现出色，使其成为自动化测试脚本创建的理想选择。Ayenew 探索了 NLP 从软件需求中自动生成测试用例的潜力，强调了自动化带来的效率优势 [26]。这些发现与 Leotta 等人一致，他们展示了基于 NLP 的测试自动化工具可以显著减少测试用例创建时间，使测试对没有广泛编程专业知识的人员更加易于访问 [27]。Wang 等人提出了一种基于 NLP 的方法，从用例规范中生成验收测试用例，展示了最近的 NLP 进展如何促进测试场景识别和形式约束生成，从而提高从自然语言需求中衍生的测试脚本的准确性 [28]。

除了大语言模型外，研究人员还探索了 NLP 在不同领域报告生成中的各种应用。Chillakuru 等人研究了 NLP 在自动化神经放射学 MRI 协议中的作用，展示了其将非结构化文本转换为结构化报告的能力 [29]。这种能力很好地转化为软件测试，NLP 可以从自然语言需求中提取测试场景，以简化报告生成。Bae 等人进一步展示了 NLP 在从自由文本报告中自动提取质量指标方面的多功能性，这种能力可以确保生成的测试脚本符合自然语言需求中指定的质量标准 [30]。同样，Tignanelli 等人强调了使用 NLP 技术自动化急诊医疗服务中治疗适当性特征描述的使用，强调了 NLP 在提高各个领域自动化报告质量和相关性方面的潜力 [31]。

尽管取得了这些进展，但在将大语言模型和 NLP 技术应用于功能测试自动化方面仍存在一些研究差距。一个显著的差距是需要对基于 NLP 的工具与传统测试方法的有效性进行实证验证。Leotta 等人指出，虽然已经引入了许多基于 NLP 的工具，但它们的优越性尚未在实践中得到严格测试 [32]。此外，缺乏将大语言模型与现有测试方法相结合的全面框架，这可以促进从自然语言需求到自动化测试脚本的更无缝过渡。此外，处理自然语言中模糊或不完整需求的挑战仍然是一个关键问题。虽然大语言模型在解释复杂文本方面表现出色，但自然语言的变异性可能导致误解，从而影响生成的测试脚本的质量。Jen-Tse 等人的研究表明，许多生成的测试用例可能无法保留预期的语义含义，导致高误报率。通过改进训练方法和更强大的 NLP 技术来解决这些挑战，对于提高自动化测试生成的可靠性至关重要 [32]。

3 Methodology

3 方法论

Some applications of Large Language Models have been effective in understanding the natural language. However, challenges remain with respect to the amount of data that can be used as a context to leverage the few shot learning of LLM. This makes the usage of LLMs in specialist domains such as test automation quite challenging. One of the solutions is to fine tune the LLM to large amount of data related to a specific field which may result in better reasoning, but then again, it is time consuming and costly process and does not work well for dynamic data. For specific application of enterprise test automation, in context learning is the best approach. This is particularly useful for feeding large web DOM structures to utilize LLM for automation script generation, overall site sense making and to get important insight from the website. After struggling with the limitations of in context learning and trying few approaches such as chunking, we have developed a novel approach to express overall site structure so that the hierarchy remains intact. This is a critical element in our intelligent quality engineering solution. This research introduces an innovative methodology to construct a hierarchical structural representation of enterprise web applications, optimized for few-shot learning in large language models (LLMs). The approach leverages state-of-the-art functional test automation principles to ensure s cal ability, modularity, and enhanced contextual understanding. The proposed methodology is divided into four phases, each targeting a specific aspect of web application analysis and representation.

大语言模型在理解自然语言方面的一些应用已经取得了成效。然而，在可用于上下文的数据量方面仍存在挑战，这限制了利用大语言模型的少样本学习能力。这使得大语言模型在测试自动化等专业领域的使用变得相当具有挑战性。其中一个解决方案是对大语言模型进行微调，使其适应特定领域的大量数据，这可能会带来更好的推理能力，但这一过程耗时且成本高昂，并且对动态数据的处理效果不佳。对于企业测试自动化的特定应用，上下文学习是最佳方法。这对于将大型网页 DOM 结构输入大语言模型以生成自动化脚本、整体网站理解以及从网站中获取重要见解特别有用。在经历了上下文学习的局限性并尝试了分块等几种方法后，我们开发了一种新颖的方法来表达整体网站结构，以保持层次结构的完整性。这是我们智能质量工程解决方案中的一个关键要素。本研究介绍了一种创新的方法，用于构建企业 Web 应用程序的层次结构表示，该表示针对大语言模型的少样本学习进行了优化。该方法利用最先进的功能测试自动化原则，以确保可扩展性、模块化和增强的上下文理解。所提出的方法分为四个阶段，每个阶段针对 Web 应用程序分析和表示的特定方面。

3.1 Phase 1: Comprehensive DOM Analysis and Data Structuring

3.1 第一阶段：全面DOM分析与数据结构化

The first phase involves a comprehensive analysis of the Document Object Model (DOM) of the target web application. Utilizing a custom scraping agent, the methodology extracts all interactive and non-interactive elements from every page of the application, starting from the base URL. Key features of this phase include:

第一阶段涉及对目标 Web 应用程序的文档对象模型 (Document Object Model, DOM) 进行全面分析。该方法利用自定义的抓取代理，从应用程序的基 URL 开始，提取每个页面中的所有交互和非交互元素。此阶段的关键特征包括：

The extracted information is encapsulated into structured representations to facilitate downstream processing. This structuring preserves the integrity of the application’s element hierarchy and contextual relationships, ensuring compatibility with LLM-based reasoning.

提取的信息被封装成结构化表示，以便于下游处理。这种结构化保留了应用程序元素层次结构和上下文关系的完整性，确保与基于大语言模型的推理兼容。

3.2 Phase 2: Multi-Page Analysis and Site-Wise Synthesis

3.2 第二阶段：多页面分析与站点综合

The methodology extends individual page analysis to a multi-page context by synthesizing relationships across the application. This phase comprises:

该方法通过综合应用程序中的关系，将单个页面分析扩展到多页面上下文。此阶段包括：

The outcome of this phase is a comprehensive site structure that encodes navigational pathways, interactive dependencies, and dynamic state transitions and type of pages. This hierarchical structure is crafted to maximize LLM interpret ability while minimizing input size constraints. Advanced chunking algorithms ensure that the hierarchical integrity is preserved, enabling effective reasoning over multi-layered representations. Our approach to represent the web application with this structure format maximize outcome by utilizing LLM’s in context learning. As out test generation approach is instruction driven, we can leverage this representation of web application structure to generate test cases which strictly follows instruction.

此阶段的成果是一个全面的站点结构，该结构编码了导航路径、交互依赖关系、动态状态转换以及页面类型。这种层次结构旨在最大化大语言模型 (LLM) 的解释能力，同时最小化输入大小的限制。先进的分块算法确保了层次结构的完整性，从而能够在多层表示上进行有效的推理。我们通过这种结构格式来表示 Web 应用程序，利用大语言模型的上下文学习能力来最大化成果。由于我们的测试生成方法是指令驱动的，因此我们可以利用这种 Web 应用程序结构的表示来生成严格遵循指令的测试用例。

3.3 Phase 3: Contextual and Relevant Test Suite Generation and Validation

3.3 阶段 3：上下文相关测试套件的生成与验证

The final phase demonstrates the utility of the hierarchical structure through the generation of context-aware test suites. The overall site representation is being passed to LLM in a formatted prompt in an iterative manner for each page. Each prompt contains information about already generated test cases, extracted URL patterns, provided instructions, available elements, required test types. Required test types are combination of predefined test types, E and extracted test types from instruction, A. Required test types

最终阶段通过生成上下文感知的测试套件展示了分层结构的实用性。整个站点的表示以格式化提示的方式迭代传递给大语言模型 (LLM) ，每个页面的提示包含已生成的测试用例、提取的 URL 模式、提供的指令、可用元素以及所需的测试类型等信息。所需的测试类型是预定义测试类型 E 和从指令中提取的测试类型 A 的组合。

$$
R=E\cup A
$$

This involves:

这涉及：

Iterative Refinement: Incorporating feedback loops to optimize test quality, ensuring alignment with application behaviour and hierarchical context.
迭代优化：通过引入反馈循环来优化测试质量，确保与应用程序行为及层次结构上下文的契合。

The generated test suites serve as a practical validation of the methodology’s capability to encapsulate enterprise web application structures in a format conducive to few-shot learning and automated reasoning.

生成的测试套件作为该方法能力的实际验证，能够以有利于少样本学习和自动推理的格式封装企业Web应用程序结构。

3.4 Phase 4: Test Suite Execution

3.4 阶段 4: 测试套件执行

After we have generated a semantically rich formal representation of the web application in four phases above, we ca execute test cases by driving a test framework. We also need to map specific test data to test steps under each test cases or alternatively generate synthetic data to execute autonomous test cases. This phase has two major parts:

在我们通过上述四个阶段生成了具有丰富语义的Web应用程序形式化表示后，我们可以通过驱动测试框架来执行测试用例。我们还需要将特定的测试数据映射到每个测试用例下的测试步骤，或者生成合成数据以执行自主测试用例。此阶段包含两个主要部分：

Test Data Mapping or Synthetic Data generation: Based on element pattern and test case, we employ LLM to map test data. The test data schema and test case with valid test steps are used as context for LLM to return proper mapping. If test data is not present, LLM’s in context learning is being used to generate meaningful synthetic data for given iteration to enable test suites for execution.
测试数据映射或合成数据生成：基于元素模式和测试用例，我们使用大语言模型 (LLM) 来映射测试数据。测试数据模式和包含有效测试步骤的测试用例被用作上下文，以便大语言模型返回正确的映射。如果测试数据不存在，则利用大语言模型的上下文学习能力为给定的迭代生成有意义的合成数据，以便测试套件能够执行。
Interpreter for Test Suite to Test Automation Framework Language: Test automation framework library or tool can be used by writing code in supported programming languages or by using tool’s own language or representations. Our approach involves training LLM with appropriate tool specific knowledge to overcome this interpretation challenge. According to our experiments, LLM tends to understand codes better than tool specific languages. For interpretation, our approach converts each test steps into tool specific representations. The representation uses set of defined actions for the tool to be used.
测试套件到测试自动化框架语言的解释器：测试自动化框架库或工具可以通过使用支持的编程语言编写代码或使用工具自身的语言或表示来使用。我们的方法涉及训练大语言模型（LLM）以掌握特定工具的知识，从而克服这一解释挑战。根据我们的实验，LLM 往往比特定工具的语言更能理解代码。为了进行解释，我们的方法将每个测试步骤转换为特定工具的表示。该表示使用为要使用的工具定义的一组操作。

3.5 Phase 5: Test Report and Analysis

3.5 阶段 5: 测试报告与分析

All the executed test suites by the proposed system produce results. The results produced by the system can be easily understandable by test engineers but might not make much sense to individuals from non testing background. Thus our approach involves LLM based summaries on top of the result that is easy to understand to any individual with a bit of technical knowledge. Our representation of result also leverages LLM’s in context learning as the result must co relate with the web application on which the test cases got generated in the first place. To point out exact issue of the system from the test result, our representation of the web application again plays a role as a test report enhancer. This behaviour mimics experienced test engineers who has extensive knowledge about the web application while producing a reporting about test results.

所提出的系统执行的所有测试套件都会产生结果。系统产生的结果对于测试工程师来说很容易理解，但对于非测试背景的个人来说可能意义不大。因此，我们的方法包括在结果之上生成基于大语言模型的摘要，这些摘要对于具有一定技术知识的任何人都易于理解。我们的结果表示还利用了大语言模型的上下文学习能力，因为结果必须与最初生成测试用例的Web应用程序相关联。为了从测试结果中指出系统的确切问题，我们的Web应用程序表示再次扮演了测试报告增强器的角色。这种行为模仿了经验丰富的测试工程师，他们在生成测试结果报告时对Web应用程序有深入的了解。

The entire phases are represented in Fig 1. Figure 1 illustrate

[论文翻译]利用大语言模型高效表示企业Web应用程序结构以服务于智能质量工程

原文地址：https://arxiv.org/pdf/2501.06837

AN EFFICIENT APPROACH TO REPRESENT ENTERPRISE WEBAPPLICATION STRUCTURE USING LARGE LANGUAGE MODEL IN THE SERVICE OF INTELLIGENT QUALITY ENGINEERING

利用大语言模型高效表示企业Web应用程序结构以服务于智能质量工程

ABSTRACT

摘要

1 Introduction

1 引言

2 背景与相关工作

3 Methodology

3 方法论

3.1 Phase 1: Comprehensive DOM Analysis and Data Structuring

3.1 第一阶段：全面DOM分析与数据结构化

3.2 Phase 2: Multi-Page Analysis and Site-Wise Synthesis

3.2 第二阶段：多页面分析与站点综合

3.3 Phase 3: Contextual and Relevant Test Suite Generation and Validation

3.3 阶段 3：上下文相关测试套件的生成与验证

3.4 Phase 4: Test Suite Execution

3.4 阶段 4: 测试套件执行

3.5 Phase 5: Test Report and Analysis

3.5 阶段 5: 测试报告与分析

[论文翻译]利用大语言模型高效表示企业Web应用程序结构以服务于智能质量工程

原文地址：https://arxiv.org/pdf/2501.06837

AN EFFICIENT APPROACH TO REPRESENT ENTERPRISE WEBAPPLICATION STRUCTURE USING LARGE LANGUAGE MODEL IN THE SERVICE OF INTELLIGENT QUALITY ENGINEERING

利用大语言模型高效表示企业Web应用程序结构以服务于智能质量工程

ABSTRACT

摘要

1 Introduction

1 引言

2 Background and Related Work

2 背景与相关工作

3 Methodology

3 方法论

3.1 Phase 1: Comprehensive DOM Analysis and Data Structuring

3.1 第一阶段：全面DOM分析与数据结构化

3.2 Phase 2: Multi-Page Analysis and Site-Wise Synthesis

3.2 第二阶段：多页面分析与站点综合

3.3 Phase 3: Contextual and Relevant Test Suite Generation and Validation

3.3 阶段 3：上下文相关测试套件的生成与验证

3.4 Phase 4: Test Suite Execution

3.4 阶段 4: 测试套件执行

3.5 Phase 5: Test Report and Analysis

3.5 阶段 5: 测试报告与分析