[论文翻译]利用大语言模型高效表示企业Web应用程序结构以服务于智能质量工程


原文地址:https://arxiv.org/pdf/2501.06837


AN EFFICIENT APPROACH TO REPRESENT ENTERPRISE WEBAPPLICATION STRUCTURE USING LARGE LANGUAGE MODEL IN THE SERVICE OF INTELLIGENT QUALITY ENGINEERING

利用大语言模型高效表示企业Web应用程序结构以服务于智能质量工程

ABSTRACT

摘要

This paper presents a novel approach to represent enterprise web application structures using Large Language Models (LLMs) to enable intelligent quality engineering at scale. We introduce a hierarchical representation methodology that optimizes the few-shot learning capabilities of LLMs while preserving the complex relationships and interactions within web applications. The approach encompasses five key phases: comprehensive DOM analysis, multi-page synthesis, test suite generation, execution, and result analysis. Our methodology addresses existing challenges around usage of Generative AI techniques in automated software testing by developing a structured format that enables LLMs to understand web application architecture through in-context learning. We evaluated our approach using two distinct web applications: an e-commerce platform (Swag Labs) and a healthcare application (MediBox) which is deployed within Atalgo engineering environment. The results demonstrate success rates of $90%$ and $70%$ , respectively, in achieving automated testing, with high relevance scores for test cases across multiple evaluation criteria. The findings suggest that our representation approach significantly enhances LLMs’ ability to generate con textually relevant test cases and provide better quality assurance overall, while reducing the time and effort required for testing.

本文提出了一种利用大语言模型 (LLMs) 表示企业 Web 应用程序结构的新方法,以实现大规模智能质量工程。我们引入了一种分层表示方法,该方法优化了 LLMs 的少样本学习能力,同时保留了 Web 应用程序中的复杂关系和交互。该方法包括五个关键阶段:全面的 DOM 分析、多页面合成、测试套件生成、执行和结果分析。我们的方法通过开发一种结构化格式,使 LLMs 能够通过上下文学习理解 Web 应用程序架构,从而解决了在自动化软件测试中使用生成式 AI 技术的现有挑战。我们使用两个不同的 Web 应用程序评估了我们的方法:一个电子商务平台 (Swag Labs) 和一个部署在 Atalgo 工程环境中的医疗保健应用程序 (MediBox)。结果表明,在实现自动化测试方面,成功率分别为 $90%$ 和 $70%$,并且在多个评估标准中测试用例的相关性得分较高。研究结果表明,我们的表示方法显著增强了 LLMs 生成上下文相关测试用例的能力,并提供了更好的整体质量保证,同时减少了测试所需的时间和精力。

Keywords Large Language Model (LLM) $\cdot$ In-Context Learning $\cdot$ Document Object Model (DOM) $\cdot$ Generative AI $\cdot$ Hierarchical Representation $\cdot$ Enterprise Test Automation $\cdot$ Intelligent Quality Engineering

关键词 大语言模型 (LLM) $\cdot$ 上下文学习 $\cdot$ 文档对象模型 (DOM) $\cdot$ 生成式 AI (Generative AI) $\cdot$ 层次表示 $\cdot$ 企业测试自动化 $\cdot$ 智能质量工程

1 Introduction

1 引言

Enterprise web applications constitute the fundamental infrastructure that orchestrates intricate organizational processes and facilitates multiple user interactions fulfilling thee processes [1]. These applications have transcended their traditional role as mere digital interfaces to become critical determinants of operational excellence and market competitiveness for an enterprise. The exponential growth in application complexity, coupled with heightened user expectations, has elevated quality engineering (QE) to a position of paramount significance in the software development lifecycle. This emerging paradigm encompasses a comprehensive framework of methodologies and practices designed to ensure robust functionality, seamless s cal ability, and sustainable maintainability of enterprise web applications. Within this context, the accurate representation and analysis of architectural intricacies in enterprise web applications has emerged as crucial elements in achieving holistic quality assurance objectives. When it comes to formal computational model of quality engineering processes, precise and formal structural representation of an enterprise application becomes critical. There is a strong correlation between accurate architectural comprehension and an autonomous system’s ability to provide acceptable level of quality assurance [2]. The evolution of enterprise web applications from monolithic structures to complex, distributed systems has created an imperative for more sophisticated methods of automated structural analysis before applying quality assurance processes. Software testing represents a critical aspect of a software engineering lifecycle, serving as a determinant of system reliability, functional integrity, and overall quality of the software that will eventually support key processes of a business at scale. As contemporary software systems grow increasingly sophisticated, the imperative for comprehensive testing methodologies becomes progressively more pronounced. Beyond the conventional paradigm of defect identification and vulnerability remediation, robust testing frameworks validate system compliance across diverse operational environments while ensuring adherence to specified requirements. This multifaceted approach serves as a crucial safeguard against system failures, security breaches, and compromised user experiences. The evolution of AI augmented intelligent quality engineering practices transcends mere defect detection, managing the entire lifecycle independently and autonomously [3]. The recent applications of AI technologies have created significant interest from research as well as business community to explore how a formal and computational model of quality engineering can provide more efficient quality assurance [4]. Recent AI algorithms and technology stack are showing substantial promise to the testing ecosystem through automation capabilities, adaptive learning mechanisms, and predictive analytics. This synergy between AI and testing methodologies has significant potential impact on traditional approaches by optimizing test procedures, minimizing manual intervention, and enabling intelligent test case generation coupled with automated anomaly detection systems. Natural Language Processing (NLP), a specialized domain within AI, has initiated a transformation in software testing practices by introducing sophisticated linguistic comprehension capabilities to the testing environment [5]. Generative AI based methodologies enable the formal interpretation of the requirements written in natural language or semi formal format [6]. This facilitates context-aware testing strategies which are a paradigm shift from traditional testing approaches. By incorporating semantic understanding into the testing framework, new approach enables the intelligent design of test cases based on specifications and user feedback analysis autonomously. This not only expedites the identification of ambiguities and inconsistencies but also frees up the time of the software development team to be able to focus on other areas [7].

企业 Web 应用程序构成了协调复杂组织流程并促进多个用户交互以满足这些流程的基础设施 [1]。这些应用程序已经超越了其作为数字接口的传统角色,成为企业运营卓越性和市场竞争力的关键决定因素。应用程序复杂性的指数级增长,加上用户期望的提高,使质量工程 (QE) 在软件开发生命周期中占据了至关重要的地位。这一新兴范式包含了一套全面的方法论和实践框架,旨在确保企业 Web 应用程序的稳健功能、无缝可扩展性和可持续的可维护性。在此背景下,企业 Web 应用程序中架构复杂性的准确表示和分析已成为实现全面质量保证目标的关键要素。在质量工程过程的正式计算模型中,企业应用程序的精确和正式的结构表示变得至关重要。准确的架构理解与自主系统提供可接受质量保证水平的能力之间存在强相关性 [2]。企业 Web 应用程序从单体结构向复杂分布式系统的演变,使得在应用质量保证过程之前,需要更复杂的自动化结构分析方法。软件测试代表了软件工程生命周期中的一个关键方面,作为系统可靠性、功能完整性和最终支持大规模业务关键流程的软件整体质量的决定因素。随着当代软件系统变得越来越复杂,全面测试方法的需求变得越来越明显。除了传统的缺陷识别和漏洞修复范式之外,稳健的测试框架还验证了系统在不同操作环境中的合规性,同时确保遵守指定的要求。这种多方面的方法作为防止系统故障、安全漏洞和用户体验受损的关键保障。AI 增强的智能质量工程实践的演变超越了单纯的缺陷检测,独立自主地管理整个生命周期 [3]。最近 AI 技术的应用引起了研究和商业界的极大兴趣,探索质量工程的正式计算模型如何提供更高效的质量保证 [4]。最近的 AI 算法和技术栈通过自动化能力、自适应学习机制和预测分析,为测试生态系统展示了巨大的前景。AI 与测试方法之间的协同作用对传统方法具有显著的潜在影响,通过优化测试程序、最小化手动干预以及实现智能测试用例生成与自动异常检测系统的结合。自然语言处理 (NLP) 作为 AI 中的一个专门领域,通过向测试环境引入复杂的语言理解能力,开启了软件测试实践的变革 [5]。基于生成式 AI 的方法使得能够正式解释以自然语言或半正式格式编写的需求 [6]。这促进了上下文感知的测试策略,这是对传统测试方法的范式转变。通过将语义理解纳入测试框架,新方法能够基于规范和用户反馈分析自主设计智能测试用例。这不仅加速了歧义和不一致性的识别,还释放了软件开发团队的时间,使其能够专注于其他领域 [7]。

Nevertheless, as per our research work as part of developing a computational intelligence-based quality engineering platform, we found that the task of effectively capturing and representing the intricate architectural patterns and interaction paradigms within enterprise web applications presents formidable challenges. The heterogeneous nature of development frameworks, technological stacks, and architectural design patterns introduces substantial complexity in modeling application structures with a level of precision we need such systems to work autonomously. Contemporary methodologies frequently demonstrate limitations in providing dynamic and adaptive representations, thus constraining their efficacy in addressing the sophisticated requirements of modern enterprise systems as the system change occurs periodically. This introduces growing challenges in the quality assurance area as existing script for testing might break and provide false result that might lead to false assumptions and failure to detect crucial bugs in the software systems as the system gets updated periodically [8]. The NLP based solutions for test automation tried to solve this problem with self-healing. Self-healing approaches were effective for small changes in the software system but failed when massive changes were introduced to software systems [9]. Also, NLP based solutions helped only on test script maintenance but not in initial test script generation or test result reporting. The reason behind was the lack of overall system understanding of the NLP based systems [10]. The emergence of Large Language Models (LLMs), particularly the Generative Pre-trained Transformer (GPT) architecture, has presented us with unprecedented opportunities within the software engineering domain [11]. These models exhibit exceptional prowess in natural language processing and demonstrate remarkable capabilities in generating con textually pertinent and semantically enriched content. Such capabilities can be strategically leveraged to autonomously generate comprehensive representations of web application structures through the systematic interpretation of user interactions, technical documentation, and source code artifacts. The integration of LLMs enables the creation of intelligent, adaptive representations that can serve as a robust foundation for quality engineering initiatives, encompassing automated testing frameworks, anomaly detection mechanisms, and performance optimization strategies [12]. This research presents a novel methodological framework for representing enterprise web application structures through the strategic deployment of LLMs. Our proposed approach transcends traditional static modeling by not only capturing fundamental elements such as page hierarchies and navigation patterns but also dynamically modeling complex user interactions and system behaviors [13]. The integration of LLM-driven representation mechanisms into established quality engineering workflows offers organizations substantial opportunities for operational efficiency enhancement and quality assurance optimization.

然而,作为我们开发基于计算智能的质量工程平台的研究工作的一部分,我们发现,有效捕捉和表示企业Web应用程序中复杂的架构模式和交互范式任务面临着巨大的挑战。开发框架、技术栈和架构设计模式的异质性在建模应用结构时引入了显著的复杂性,而我们需要这些系统能够自主工作。当代方法在提供动态和自适应表示方面经常表现出局限性,从而限制了它们在满足现代企业系统复杂需求方面的效能,尤其是在系统周期性变化时。这给质量保证领域带来了日益增长的挑战,因为现有的测试脚本可能会失效并提供错误的结果,这可能导致错误的假设和未能检测到软件系统中的关键错误,尤其是在系统周期性更新时 [8]。基于自然语言处理(NLP)的测试自动化解决方案试图通过自愈来解决这个问题。自愈方法在软件系统发生小变化时有效,但在软件系统发生大规模变化时失效 [9]。此外,基于NLP的解决方案仅在测试脚本维护方面有所帮助,而在初始测试脚本生成或测试结果报告方面并无帮助。其背后的原因是基于NLP的系统缺乏对整体系统的理解 [10]。大语言模型(LLMs)的出现,特别是生成式预训练Transformer(GPT)架构,为我们在软件工程领域提供了前所未有的机会 [11]。这些模型在自然语言处理方面表现出卓越的能力,并在生成上下文相关且语义丰富的内容方面展示了显著的能力。这些能力可以战略性地用于通过系统解释用户交互、技术文档和源代码工件来自主生成Web应用程序结构的全面表示。LLMs的集成使得创建智能、自适应的表示成为可能,这些表示可以作为质量工程计划的坚实基础,涵盖自动化测试框架、异常检测机制和性能优化策略 [12]。本研究提出了一种通过战略部署LLMs来表示企业Web应用程序结构的新方法框架。我们提出的方法超越了传统的静态建模,不仅捕捉了页面层次结构和导航模式等基本元素,还动态建模了复杂的用户交互和系统行为 [13]。将LLM驱动的表示机制集成到现有的质量工程工作流程中,为组织提供了增强运营效率和优化质量保证的实质性机会。

This paper presents an efficient approach to leveraging Large Language Models for representing enterprise web application structures, specifically focusing on their application in quality engineering [13]. Our approach addresses several key challenges identified in current literature: the need for dynamic representation of application structure, the ability to capture complex relationships between components, the hierarchical flow of navigation elements and the integration of this representation into existing quality engineering processes. By utilizing LLMs’ natural language understanding capabilities, we propose a method that bridges the gap between human understanding and machineprocess able representation of web application architectures. Thus, in this research, we have showcased all of the aspects of a software testing process for a web application; test case generation [14], test script maintenance and reporting of the test results. The significance of this research lies in its potential to enhance quality engineering practices by effectively employing generative AI based computational intelligence through improved understanding and representation of enterprise web applications. Our approach not only addresses the limitations of current methods but also provides a foundation for more intelligent and adaptive quality engineering processes in the context of modern web development

本文提出了一种利用大语言模型(Large Language Model, LLM)表示企业Web应用结构的有效方法,特别关注其在质量工程中的应用 [13]。我们的方法解决了当前文献中提出的几个关键挑战:应用结构的动态表示需求、捕捉组件间复杂关系的能力、导航元素的层次流以及将该表示集成到现有质量工程流程中。通过利用LLM的自然语言理解能力,我们提出了一种弥合人类理解与机器可处理Web应用架构表示之间差距的方法。因此,在本研究中,我们展示了Web应用软件测试过程的所有方面:测试用例生成 [14]、测试脚本维护和测试结果报告。本研究的意义在于,通过基于生成式AI(Generative AI)的计算智能,有效提升对企业Web应用的理解和表示,从而增强质量工程实践。我们的方法不仅解决了当前方法的局限性,还为现代Web开发背景下更智能和自适应的质量工程流程奠定了基础。

2 Background and Related Work

2 背景与相关工作

Recent years have witnessed growing research interest in converting natural language requirements automatically into functional test scripts, driven by the increasing adoption of agile development and continuous integration practices. This review examines key developments in the field, with a particular focus on systematic analyses, approaches, and automation tools that enable requirements to be transformed into executable test scripts. The systematic literature review by Mustafa et al. [15] represents seminal work in this domain. Their analysis presents a structured overview of different automated test generation approaches derived from requirements specifications. A key finding emphasizes how test generation approaches must be carefully matched to handle the inherent characteristics of requirements, including their potential ambiguity and incompleteness. Building on this foundation, some researchers [16] developed an extensive classification system for requirement-based test automation techniques, while also identifying critical research gaps and obstacles that need to be addressed to improve these methods’ efficacy. Both research efforts highlight the critical need to develop sophisticated frameworks capable of accurately processing natural language requirements and producing corresponding test scripts. Taking a more practical perspective, Chr ys a lid is et al. developed an innovative semi-automated toolchain system that converts natural language requirements into executable code specifically for flight control platforms [17]. Their work illustrates how domain knowledge can be effectively combined with automation to streamline test script creation. The system enables engineers to structure requirements in modules that can then be automatically transformed into executable tests, providing a concrete implementation of theoretical concepts outlined in systematic reviews. The research by Koroglu & S¸en breaks new ground by applying reinforcement learning techniques to generate functional tests from UI test scenarios written in human-friendly languages like Gherkin [18]. Their approach tackles the complex challenge of converting high-level declarative requirements into concrete test scripts, effectively connecting natural language specifications with automated testing frameworks. Their research suggests machine learning can substantially improve both the precision and efficiency of test script generation. The literature also emphasizes the importance of automation tools like Selenium, with Rus dian s yah outlining optimal practices for web testing using Selenium, highlighting its capabilities for automating user interactions and enhancing test accuracy [19]. This aligns with Zasornova’s observations that automated testing enables improved efficiency and faster detection of defects - crucial factors in contemporary software development [20]. Yutia presents an alternative approach through keyword-driven frameworks for automated functional testing. This methodology enables testers to develop succinct, adaptable test cases - particularly valuable when working with evolving natural language requirements [21]. The previous works on automated test script generation demonstrates consistent efforts to advance the script generation from natural language requirements. The combination of systematic reviews, practical toolchains, and sophisticated approaches including reinforcement learning and keyword-driven frameworks shows a comprehensive strategy for addressing industry challenges. Malik et al. [22] made significant contributions through their work on automating test oracles from restricted natural language agile requirements. Their proposed Restricted Natural Language Agile Requirements Testing (ReNaLART) methodology employs structured templates to facilitate test case generation. This research demonstrates how Large Language Models can effectively interpret and transform natural language requirements into functional test scripts, addressing inherent language ambiguities. Additionally, Kıcı et al. [23] investigated using BERT-based transfer learning to classify software requirements specifications. Their research reveals that Large Language Models can substantially enhance requirements understanding through classification, which subsequently aids in automated test script generation. Their findings emphasize the potential impact of Large Language Models in improving both accuracy and efficiency in test script generation processes. Raharjana et al. [24] conducted a systematic literature review examining how user stories and NLP techniques are utilized in agile software development. Their analysis demonstrates how NLP enhanced by Large Language Models can extract testable requirements from user stories for conversion into automated test scripts, reflecting the increasing adoption of LLMs to connect natural language requirements with automated testing frameworks. Liu et al. [25] introduced MuF BD Tester, an innovative mutation-based system for generating test sequences for function block diagram programs. While their work centers on mutation testing, the core principles of generating test sequences from specifications can be enhanced through LLM integration. The ability of LLMs to parse complex specifications enables more efficient test sequence generation aligned with intended software functionality. Complementing this research, explores automated testing challenges in complex software systems, suggesting that LLMs could play a vital role in generating test cases that accurately reflect user requirements and enhance testing processes. Large Language Models like GPT-3 and its successors have shown exceptional capability in natural language understanding and generation, making them ideal for automated test script creation. Ayenew explores NLP’s potential for automated test case generation from software requirements, emphasizing the efficiency benefits of automation [26]. These findings align with Leotta et al., who demonstrate that NLP-based test automation tools can dramatically reduce test case creation time, making testing more accessible to professionals without extensive programming expertise [27]. Wang et al. present an NLP-driven approach for generating acceptance test cases from use case specifications, showing how recent NLP advances facilitate test scenario identification and formal constraint generation, thereby improving the accuracy of test scripts derived from natural language requirements [28]. Beyond LLMs, researchers have explored various NLP applications in report generation across different domains. Chillakuru et al. examine NLP’s role in automating neuro radiology MRI protocols, demonstrating its ability to convert unstructured text into structured reports [29]. This capability translates well to software testing, where NLP can extract test scenarios from natural language requirements to streamline report generation. Bae et al. further demonstrate NLP’s versatility in automatically extracting quality indicators from free-text reports, a capability that can ensure generated test scripts align with quality standards specified in natural language requirements [30]. Similarly, Tignanelli et al. highlight the use of NLP techniques to automate the characterization of treatment appropriateness in emergency medical services, emphasizing the potential for NLP to enhance the quality and relevance of automated reports in various domains [31]. Despite these advancements, several research gaps remain in the application of LLMs and NLP techniques for functional test automation. One significant gap is the need for empirical validation of the effectiveness of NLP-based tools compared to traditional testing methods. Leotta et al. note that while many NLP-based tools have been introduced, their superiority has not been rigorously tested in practice [32]. Additionally, there is a lack of comprehensive frameworks that integrate LLMs with existing testing methodologies, which could facilitate a more seamless transition from natural language requirements to automated test scripts. Furthermore, the challenge of handling ambiguous or incomplete requirements in natural language remains a critical issue. While LLMs have shown promise in interpreting complex text, the variability in natural language can lead to misinterpretations that affect the quality of generated test scripts. Research by Jen-Tse et al. Jen-tse et al. indicates that many generated test cases may not preserve the intended semantic meaning, leading to high false alarm rates. Addressing these challenges through improved training methodologies and more robust NLP techniques is essential for enhancing the reliability of automated test generation [32].

近年来,随着敏捷开发和持续集成实践的日益普及,将自然语言需求自动转换为功能测试脚本的研究兴趣逐渐增长。本文回顾了该领域的关键发展,特别关注能够将需求转化为可执行测试脚本的系统分析、方法和自动化工具。Mustafa 等人 [15] 的系统文献综述是该领域的开创性工作。他们的分析提供了从需求规范中衍生出的不同自动化测试生成方法的结构化概述。一个关键发现强调了测试生成方法必须仔细匹配以处理需求的固有特性,包括其潜在的模糊性和不完整性。在此基础上,一些研究人员 [16] 开发了一个基于需求的测试自动化技术的广泛分类系统,同时确定了需要解决的关键研究差距和障碍,以提高这些方法的有效性。这两项研究都强调了开发能够准确处理自然语言需求并生成相应测试脚本的复杂框架的迫切需求。

从更实际的角度出发,Chrysalidis 等人开发了一种创新的半自动化工具链系统,将自然语言需求转换为专门用于飞行控制平台的可执行代码 [17]。他们的工作展示了如何有效地将领域知识与自动化相结合,以简化测试脚本的创建。该系统使工程师能够将需求模块化,然后自动转换为可执行测试,提供了系统综述中概述的理论概念的具体实现。Koroglu 和 Şen 的研究通过应用强化学习技术从用 Gherkin 等人性化语言编写的 UI 测试场景中生成功能测试,开辟了新天地 [18]。他们的方法解决了将高级声明性需求转换为具体测试脚本的复杂挑战,有效地将自然语言规范与自动化测试框架连接起来。他们的研究表明,机器学习可以显著提高测试脚本生成的精度和效率。

文献还强调了 Selenium 等自动化工具的重要性,Rusdiansyah 概述了使用 Selenium 进行 Web 测试的最佳实践,强调了其在自动化用户交互和提高测试准确性方面的能力 [19]。这与 Zasornova 的观察一致,即自动化测试能够提高效率并更快地检测缺陷——这是当代软件开发中的关键因素 [20]。Yutia 提出了一种通过关键字驱动框架进行自动化功能测试的替代方法。这种方法使测试人员能够开发简洁、适应性强的测试用例——在处理不断变化的自然语言需求时特别有价值 [21]。

先前关于自动化测试脚本生成的研究表明,从自然语言需求中推进脚本生成的一致努力。系统综述、实用工具链以及包括强化学习和关键字驱动框架在内的复杂方法的结合,展示了解决行业挑战的全面策略。Malik 等人 [22] 通过他们的工作为从受限自然语言敏捷需求中自动化测试预言做出了重要贡献。他们提出的受限自然语言敏捷需求测试 (ReNaLART) 方法采用结构化模板来促进测试用例生成。这项研究展示了大语言模型如何有效地解释自然语言需求并将其转换为功能测试脚本,解决了固有的语言模糊性。此外,Kıcı 等人 [23] 研究了使用基于 BERT 的迁移学习对软件需求规范进行分类。他们的研究表明,大语言模型可以通过分类显著增强需求理解,从而有助于自动化测试脚本生成。他们的发现强调了大语言模型在提高测试脚本生成过程的准确性和效率方面的潜在影响。

Raharjana 等人 [24] 进行了一项系统文献综述,研究了用户故事和 NLP 技术在敏捷软件开发中的应用。他们的分析表明,通过大语言模型增强的 NLP 如何从用户故事中提取可测试需求,并将其转换为自动化测试脚本,反映了大语言模型在连接自然语言需求与自动化测试框架方面的日益普及。Liu 等人 [25] 引入了 MuFBD Tester,这是一种创新的基于突变的系统,用于生成功能块图程序的测试序列。虽然他们的工作集中在突变测试上,但通过大语言模型集成可以增强从规范生成测试序列的核心原则。大语言模型解析复杂规范的能力使得能够更高效地生成与预期软件功能一致的测试序列。

补充这项研究的是,探索了复杂软件系统中的自动化测试挑战,表明大语言模型可以在生成准确反映用户需求并增强测试过程的测试用例中发挥重要作用。像 GPT-3 及其后继者这样的大语言模型在自然语言理解和生成方面表现出色,使其成为自动化测试脚本创建的理想选择。Ayenew 探索了 NLP 从软件需求中自动生成测试用例的潜力,强调了自动化带来的效率优势 [26]。这些发现与 Leotta 等人一致,他们展示了基于 NLP 的测试自动化工具可以显著减少测试用例创建时间,使测试对没有广泛编程专业知识的人员更加易于访问 [27]。Wang 等人提出了一种基于 NLP 的方法,从用例规范中生成验收测试用例,展示了最近的 NLP 进展如何促进测试场景识别和形式约束生成,从而提高从自然语言需求中衍生的测试脚本的准确性 [28]。

除了大语言模型外,研究人员还探索了 NLP 在不同领域报告生成中的各种应用。Chillakuru 等人研究了 NLP 在自动化神经放射学 MRI 协议中的作用,展示了其将非结构化文本转换为结构化报告的能力 [29]。这种能力很好地转化为软件测试,NLP 可以从自然语言需求中提取测试场景,以简化报告生成。Bae 等人进一步展示了 NLP 在从自由文本报告中自动提取质量指标方面的多功能性,这种能力可以确保生成的测试脚本符合自然语言需求中指定的质量标准 [30]。同样,Tignanelli 等人强调了使用 NLP 技术自动化急诊医疗服务中治疗适当性特征描述的使用,强调了 NLP 在提高各个领域自动化报告质量和相关性方面的潜力 [31]。

尽管取得了这些进展,但在将大语言模型和 NLP 技术应用于功能测试自动化方面仍存在一些研究差距。一个显著的差距是需要对基于 NLP 的工具与传统测试方法的有效性进行实证验证。Leotta 等人指出,虽然已经引入了许多基于 NLP 的工具,但它们的优越性尚未在实践中得到严格测试 [32]。此外,缺乏将大语言模型与现有测试方法相结合的全面框架,这可以促进从自然语言需求到自动化测试脚本的更无缝过渡。此外,处理自然语言中模糊或不完整需求的挑战仍然是一个关键问题。虽然大语言模型在解释复杂文本方面表现出色,但自然语言的变异性可能导致误解,从而影响生成的测试脚本的质量。Jen-Tse 等人的研究表明,许多生成的测试用例可能无法保留预期的语义含义,导致高误报率。通过改进训练方法和更强大的 NLP 技术来解决这些挑战,对于提高自动化测试生成的可靠性至关重要 [32]。

3 Methodology

3 方法论

Some applications of Large Language Models have been effective in understanding the natural language. However, challenges remain with respect to the amount of data that can be used as a context to leverage the few shot learning of LLM. This makes the usage of LLMs in specialist domains such as test automation quite challenging. One of the solutions is to fine tune the LLM to large amount of data related to a specific field which may result in better reasoning, but then again, it is time consuming and costly process and does not work well for dynamic data. For specific application of enterprise test automation, in context learning is the best approach. This is particularly useful for feeding large web DOM structures to utilize LLM for automation script generation, overall site sense making and to get important insight from the website. After struggling with the limitations of in context learning and trying few approaches such as chunking, we have developed a novel approach to express overall site structure so that the hierarchy remains intact. This is a critical element in our intelligent quality engineering solution. This research introduces an innovative methodology to construct a hierarchical structural representation of enterprise web applications, optimized for few-shot learning in large language models (LLMs). The approach leverages state-of-the-art functional test automation principles to ensure s cal ability, modularity, and enhanced contextual understanding. The proposed methodology is divided into four phases, each targeting a specific aspect of web application analysis and representation.

大语言模型在理解自然语言方面的一些应用已经取得了成效。然而,在可用于上下文的数据量方面仍存在挑战,这限制了利用大语言模型的少样本学习能力。这使得大语言模型在测试自动化等专业领域的使用变得相当具有挑战性。其中一个解决方案是对大语言模型进行微调,使其适应特定领域的大量数据,这可能会带来更好的推理能力,但这一过程耗时且成本高昂,并且对动态数据的处理效果不佳。对于企业测试自动化的特定应用,上下文学习是最佳方法。这对于将大型网页 DOM 结构输入大语言模型以生成自动化脚本、整体网站理解以及从网站中获取重要见解特别有用。在经历了上下文学习的局限性并尝试了分块等几种方法后,我们开发了一种新颖的方法来表达整体网站结构,以保持层次结构的完整性。这是我们智能质量工程解决方案中的一个关键要素。本研究介绍了一种创新的方法,用于构建企业 Web 应用程序的层次结构表示,该表示针对大语言模型的少样本学习进行了优化。该方法利用最先进的功能测试自动化原则,以确保可扩展性、模块化和增强的上下文理解。所提出的方法分为四个阶段,每个阶段针对 Web 应用程序分析和表示的特定方面。

3.1 Phase 1: Comprehensive DOM Analysis and Data Structuring

3.1 第一阶段:全面DOM分析与数据结构化

The first phase involves a comprehensive analysis of the Document Object Model (DOM) of the target web application. Utilizing a custom scraping agent, the methodology extracts all interactive and non-interactive elements from every page of the application, starting from the base URL. Key features of this phase include:

第一阶段涉及对目标 Web 应用程序的文档对象模型 (Document Object Model, DOM) 进行全面分析。该方法利用自定义的抓取代理,从应用程序的基 URL 开始,提取每个页面中的所有交互和非交互元素。此阶段的关键特征包括:

The extracted information is encapsulated into structured representations to facilitate downstream processing. This structuring preserves the integrity of the application’s element hierarchy and contextual relationships, ensuring compatibility with LLM-based reasoning.

提取的信息被封装成结构化表示,以便于下游处理。这种结构化保留了应用程序元素层次结构和上下文关系的完整性,确保与基于大语言模型的推理兼容。

3.2 Phase 2: Multi-Page Analysis and Site-Wise Synthesis

3.2 第二阶段:多页面分析与站点综合

The methodology extends individual page analysis to a multi-page context by synthesizing relationships across the application. This phase comprises:

该方法通过综合应用程序中的关系,将单个页面分析扩展到多页面上下文。此阶段包括:

The outcome of this phase is a comprehensive site structure that encodes navigational pathways, interactive dependencies, and dynamic state transitions and type of pages. This hierarchical structure is crafted to maximize LLM interpret ability while minimizing input size constraints. Advanced chunking algorithms ensure that the hierarchical integrity is preserved, enabling effective reasoning over multi-layered representations. Our approach to represent the web application with this structure format maximize outcome by utilizing LLM’s in context learning. As out test generation approach is instruction driven, we can leverage this representation of web application structure to generate test cases which strictly follows instruction.

此阶段的成果是一个全面的站点结构,该结构编码了导航路径、交互依赖关系、动态状态转换以及页面类型。这种层次结构旨在最大化大语言模型 (LLM) 的解释能力,同时最小化输入大小的限制。先进的分块算法确保了层次结构的完整性,从而能够在多层表示上进行有效的推理。我们通过这种结构格式来表示 Web 应用程序,利用大语言模型的上下文学习能力来最大化成果。由于我们的测试生成方法是指令驱动的,因此我们可以利用这种 Web 应用程序结构的表示来生成严格遵循指令的测试用例。

3.3 Phase 3: Contextual and Relevant Test Suite Generation and Validation

3.3 阶段 3:上下文相关测试套件的生成与验证

The final phase demonstrates the utility of the hierarchical structure through the generation of context-aware test suites. The overall site representation is being passed to LLM in a formatted prompt in an iterative manner for each page. Each prompt contains information about already generated test cases, extracted URL patterns, provided instructions, available elements, required test types. Required test types are combination of predefined test types, E and extracted test types from instruction, A. Required test types

最终阶段通过生成上下文感知的测试套件展示了分层结构的实用性。整个站点的表示以格式化提示的方式迭代传递给大语言模型 (LLM) ,每个页面的提示包含已生成的测试用例、提取的 URL 模式、提供的指令、可用元素以及所需的测试类型等信息。所需的测试类型是预定义测试类型 E 和从指令中提取的测试类型 A 的组合。

$$
R=E\cup A
$$

$$
R=E\cup A
$$

This involves:

这涉及:

  1. Iterative Refinement: Incorporating feedback loops to optimize test quality, ensuring alignment with application behaviour and hierarchical context.
  2. 迭代优化:通过引入反馈循环来优化测试质量,确保与应用程序行为及层次结构上下文的契合。

The generated test suites serve as a practical validation of the methodology’s capability to encapsulate enterprise web application structures in a format conducive to few-shot learning and automated reasoning.

生成的测试套件作为该方法能力的实际验证,能够以有利于少样本学习和自动推理的格式封装企业Web应用程序结构。

3.4 Phase 4: Test Suite Execution

3.4 阶段 4: 测试套件执行

After we have generated a semantically rich formal representation of the web application in four phases above, we ca execute test cases by driving a test framework. We also need to map specific test data to test steps under each test cases or alternatively generate synthetic data to execute autonomous test cases. This phase has two major parts:

在我们通过上述四个阶段生成了具有丰富语义的Web应用程序形式化表示后,我们可以通过驱动测试框架来执行测试用例。我们还需要将特定的测试数据映射到每个测试用例下的测试步骤,或者生成合成数据以执行自主测试用例。此阶段包含两个主要部分:

  1. Test Data Mapping or Synthetic Data generation: Based on element pattern and test case, we employ LLM to map test data. The test data schema and test case with valid test steps are used as context for LLM to return proper mapping. If test data is not present, LLM’s in context learning is being used to generate meaningful synthetic data for given iteration to enable test suites for execution.
  2. 测试数据映射或合成数据生成:基于元素模式和测试用例,我们使用大语言模型 (LLM) 来映射测试数据。测试数据模式和包含有效测试步骤的测试用例被用作上下文,以便大语言模型返回正确的映射。如果测试数据不存在,则利用大语言模型的上下文学习能力为给定的迭代生成有意义的合成数据,以便测试套件能够执行。
  3. Interpreter for Test Suite to Test Automation Framework Language: Test automation framework library or tool can be used by writing code in supported programming languages or by using tool’s own language or representations. Our approach involves training LLM with appropriate tool specific knowledge to overcome this interpretation challenge. According to our experiments, LLM tends to understand codes better than tool specific languages. For interpretation, our approach converts each test steps into tool specific representations. The representation uses set of defined actions for the tool to be used.
  4. 测试套件到测试自动化框架语言的解释器:测试自动化框架库或工具可以通过使用支持的编程语言编写代码或使用工具自身的语言或表示来使用。我们的方法涉及训练大语言模型(LLM)以掌握特定工具的知识,从而克服这一解释挑战。根据我们的实验,LLM 往往比特定工具的语言更能理解代码。为了进行解释,我们的方法将每个测试步骤转换为特定工具的表示。该表示使用为要使用的工具定义的一组操作。

3.5 Phase 5: Test Report and Analysis

3.5 阶段 5: 测试报告与分析

All the executed test suites by the proposed system produce results. The results produced by the system can be easily understandable by test engineers but might not make much sense to individuals from non testing background. Thus our approach involves LLM based summaries on top of the result that is easy to understand to any individual with a bit of technical knowledge. Our representation of result also leverages LLM’s in context learning as the result must co relate with the web application on which the test cases got generated in the first place. To point out exact issue of the system from the test result, our representation of the web application again plays a role as a test report enhancer. This behaviour mimics experienced test engineers who has extensive knowledge about the web application while producing a reporting about test results.

所提出的系统执行的所有测试套件都会产生结果。系统产生的结果对于测试工程师来说很容易理解,但对于非测试背景的个人来说可能意义不大。因此,我们的方法包括在结果之上生成基于大语言模型的摘要,这些摘要对于具有一定技术知识的任何人都易于理解。我们的结果表示还利用了大语言模型的上下文学习能力,因为结果必须与最初生成测试用例的Web应用程序相关联。为了从测试结果中指出系统的确切问题,我们的Web应用程序表示再次扮演了测试报告增强器的角色。这种行为模仿了经验丰富的测试工程师,他们在生成测试结果报告时对Web应用程序有深入的了解。

The entire phases are represented in Fig 1. Figure 1 illustrated the process and data flow of the phases. Phase one starts with scraper, then phase 2 is about formatting the data to the proposed representation. From phase 3 the formatted data starts getting utilized and on phase 4 the represented format gets expanded. Finally on phase 5, the final reasoning step, through complete test reporting, the process flow ends.

整个过程如图 1 所示。图 1 展示了各个阶段的流程和数据流。第一阶段从抓取器开始,然后第二阶段是关于将数据格式化为提议的表示形式。从第三阶段开始,格式化数据开始被利用,在第四阶段,表示形式得到扩展。最后在第五阶段,通过完整的测试报告,最终推理步骤结束,流程结束。

4 Experiment

4 实验

To evaluate our novel approach for developing hierarchical structure representation of enterprise web applications and its utility in enabling few-shot learning with large language models (LLMs), we conducted comprehensive experiments. These experiments were designed to demonstrate the effectiveness of AI-driven methodologies, including web scraping, site analysis, automated test case generation, and execution, across different web application contexts.

为了评估我们提出的用于开发企业Web应用程序层次结构表示的新方法及其在实现大语言模型(LLMs)少样本学习中的实用性,我们进行了全面的实验。这些实验旨在展示AI驱动方法(包括网络爬虫、站点分析、自动化测试用例生成和执行)在不同Web应用场景中的有效性。

4.1 Objectives

4.1 目标

  1. Validation of scraped data for sense making from web applications
  2. 从网络应用程序中抓取数据的验证以进行意义理解


Figure 1: Overall phase flow

图 1: 整体阶段流程

4.1.1 Experimental Setup

4.1.1 实验设置

The experiments were carried out in two different enterprise Web applications:

实验在两个不同的企业 Web 应用程序中进行:

Both applications provided an opportunity to test our approach under different structural and functional paradigms, allowing a broad evaluation of the methodology.

两个应用为我们提供了在不同结构和功能范式下测试方法的机会,从而能够对方法论进行广泛的评估。

4.2 Scope of Experiments

4.2 实验范围

To evaluate our approach, we need to evaluate results of each phase individually. Each phase depends on our data representation, and we have experimented each phase with our data representation to assess the correct representation. In this section, we will discuss the experiments conducted.

为了评估我们的方法,我们需要分别评估每个阶段的结果。每个阶段都依赖于我们的数据表示,并且我们已经通过实验对每个阶段的数据表示进行了测试,以评估正确的表示方式。在本节中,我们将讨论所进行的实验。

4.2.1 Web Scraping and Data Extraction

4.2.1 网络爬虫与数据提取

For both applications, we utilized an AI-powered scraping module to extract metadata for each HTML element. This process was essential for building a structured hierarchical representation of the web applications. We have tried extracting all features and even tried extracting and passing the entire DOM tree. The earlier approaches induced problems. There were lot of unnecessary data which was clouding LLM’s decision and resulting in a number of irrelevant test cases. Also, the results of web application sense making through LLM was not up to the mark. Through our experiments, we found that elements need unique identifier apart from their locator. Locators for sibling elements look almost similar sometimes, which can potentially mislead LLM and generated test case might end up with wrong elements. The overall DOM tree can be large which makes it impractical to provide overall data to LLM for web application sense making through in context learning as LLMs have fixed context window. We cannot feed LLM with the entire DOM tree, even essential extracted features only are impractical. We have to make sure our solution works for any web application as this is going to be the foundational technology behind generative AI based intelligent quality engineering. For the reasons outlined above, traditional chunking method also does not work which we tried before developing this approach. In our current approach, we preserved the hierarchical order and then chunked DOM tree page wise. We captured navigation flow as well. That way even after page wise DOM tree chunking ensured LLM’s understanding of the page flow and the next steps for test generation. The extracted data included:

在这两个应用中,我们利用了一个由AI驱动的抓取模块来提取每个HTML元素的元数据。这一过程对于构建Web应用的结构化层次表示至关重要。我们尝试过提取所有特征,甚至尝试提取并传递整个DOM树。早期的这些方法引发了一些问题。存在大量不必要的数据,这些数据干扰了大语言模型的决策,导致生成了许多不相关的测试用例。此外,通过大语言模型进行的Web应用理解效果也不理想。通过实验,我们发现元素除了定位器外还需要唯一的标识符。有时,兄弟元素的定位器看起来几乎相同,这可能会误导大语言模型,生成的测试用例可能会错误地指向错误的元素。整个DOM树可能非常大,这使得通过上下文学习向大语言模型提供整体数据以进行Web应用理解变得不切实际,因为大语言模型有固定的上下文窗口。我们不能将整个DOM树输入大语言模型,即使是仅提取的必要特征也不切实际。我们必须确保我们的解决方案适用于任何Web应用,因为这将是基于生成式AI的智能质量工程的基础技术。出于上述原因,传统的分块方法在我们开发这种方法之前尝试过,但也不奏效。在我们当前的方法中,我们保留了层次顺序,然后按页面分块DOM树。我们还捕获了导航流程。这样,即使在按页面分块DOM树后,也能确保大语言模型理解页面流程和测试生成的下一步。提取的数据包括:

For instance, in Swag Labs, interactive elements such as the username and password input fields, login button, and error containers were captured. Similarly, in MediBox test application, key elements on the signup form, including fields for full name, contact number, email, password, and confirmation password, were identified.

例如,在 Swag Labs 中,捕获了用户名和密码输入字段、登录按钮和错误容器等交互元素。同样,在 MediBox 测试应用程序中,识别了注册表单上的关键元素,包括全名、联系电话、电子邮件、密码和确认密码等字段。

4.2.2 Site Analysis and Hierarchical Representation

4.2.2 站点分析与层次化表示

The scraped data was fed into a site analysis module to organize elements into logical sections and establish their relationships. This module created a hierarchical representation of the web applications by:

抓取的数据被输入到站点分析模块中,以将元素组织成逻辑部分并建立它们之间的关系。该模块通过以下方式创建了Web应用程序的层次结构表示:

• Categorizing elements into sections such as navigation links, input forms, and feedback mechanisms. • Mapping relationships to construct a complete DOM tree, allowing downstream processing to focus on key interactive regions of the applications.

• 将元素分类为导航链接、输入表单和反馈机制等部分。
• 映射关系以构建完整的 DOM 树,使下游处理能够专注于应用程序的关键交互区域。

The analysis facilitated by our approach effectively transformed unstructured web data into a hierarchical format, serving as a foundation for automated test case generation.

我们的方法所促进的分析有效地将非结构化网络数据转换为分层格式,为自动化测试用例生成奠定了基础。

4.2.3 Automated Test Case Generation Using LLMs

4.2.3 使用大语言模型自动生成测试用例

Our methodology leveraged generative AI and LLMs to automatically generate functional test cases. The structured hierarchical data served as input, and the LLMs were prompted to create test cases covering diverse scenarios, including:

我们的方法利用生成式 AI (Generative AI) 和大语言模型 (LLM) 自动生成功能测试用例。结构化分层数据作为输入,大语言模型被提示创建涵盖多种场景的测试用例,包括:

For example, in MediBox, test cases included verifying unique email and mobile number registration, password strength validation, and successful user registration.

例如,在MediBox中,测试用例包括验证唯一电子邮件和手机号码注册、密码强度验证以及成功用户注册。

4.2.4 Test Case Execution and Data Handling

4.2.4 测试用例执行与数据处理

Once generated, the test cases were executed systematically against the respective applications. The execution process included:

生成测试用例后,系统地对相应的应用程序执行了这些测试用例。执行过程包括:

Execution results, including logs and screenshots, were stored for further analysis.

执行结果,包括日志和截图,被存储以供进一步分析。

4.3 Applications and Scope

4.3 应用与范围

The hierarchical representation enabled effective utilization of LLMs in a few-shot learning setup, reducing the dependency on large labelled datasets. By focusing on structural and functional representations of web applications, the approach provided the following advantages:

分层表示使得在大语言模型 (LLM) 的少样本学习设置中能够有效利用,减少了对大规模标注数据集的依赖。通过关注 Web 应用程序的结构和功能表示,该方法提供了以下优势:

• S cal ability: The pipeline could adapt to various web applications with minimal customization. • Automation: From data extraction to test case execution, the process minimized human intervention. • Precision: The hierarchical representation ensured that LLMs received con textually relevant prompts, leading to accurate and comprehensive test cases.

• 可扩展性 (Scalability):该管道能够以最少的定制适应各种网络应用程序。
• 自动化 (Automation):从数据提取到测试用例执行,整个过程最大限度地减少了人为干预。
• 精确性 (Precision):分层表示确保了大语言模型接收到上下文相关的提示,从而生成准确且全面的测试用例。

4.4 Experimental Design Highlights

4.4 实验设计亮点

For Swag Labs, 10 functional test cases focused on login validation were developed and executed. For MediBox, another set of 10 test cases targeted the user signup process, ensuring robust field validation and error handling. Each test case was categorized by priority, with high-priority cases addressing core functionalities such as login authentication and secure user registration.

针对 Swag Labs,开发并执行了 10 个专注于登录验证的功能测试用例。对于 MediBox,另一组 10 个测试用例针对用户注册流程,确保强大的字段验证和错误处理。每个测试用例都按优先级分类,高优先级用例涉及核心功能,如登录认证和安全用户注册。

4.5 Execution Workflow

4.5 执行工作流

The execution of test cases across both Swag Labs and MediBox applications followed a systematic workflow, leveraging the hierarchical representation to guide the interactions:

在 Swag Labs 和 MediBox 应用程序中执行测试用例遵循了系统化的工作流程,利用层次化表示来指导交互:

1. Test Case Initialization:

1. 测试用例初始化:

2. Environment Setup:

2. 环境设置:

3. Simulation of User Interactions:

3. 用户交互模拟:

4. Real-Time Validation:

4. 实时验证:

5. Dynamic Data Handling:

5. 动态数据处理:

• Synthetic data generation modules were employed to provide realistic test inputs, particularly for MediBox’s signup form, covering edge cases like invalid email formats and weak passwords. • This ensured that the testing covered a wide spectrum of potential user inputs.

• 使用了合成数据生成模块来提供真实的测试输入,特别是针对 MediBox 的注册表单,涵盖了无效电子邮件格式和弱密码等边缘情况。
• 这确保了测试覆盖了广泛的潜在用户输入。

6. Logging and Reporting:

6. 日志记录与报告:

4.6 Challenges Encountered

4.6 遇到的挑战

While the experimental pipeline was robust, certain challenges highlighted areas for further refinement:

虽然实验流程稳健,但某些挑战凸显了进一步改进的领域:

4.7 Evaluation Metrics

4.7 评估指标

To evaluate the effectiveness of the experiment, the following metrics were used:

为了评估实验的有效性,使用了以下指标:

5 Result and Analysis

5 结果与分析

As part of our experiments, we have generated and executed test suites. The ultimate goal is to be able to provide quality assurance on the software application using generative AI. For the purposes of this specific research, we wanted to confirm whether our unique approach of site representation for the purposes of enabling LLM to conduct in context learning helps us achieve effective quality engineering. We have discussed as part pf methodology how this approach of web application representation is foundational to achieve AI augmented quality engineering. To evaluate the overall efficiency of our data structure, we have assessed the quality of the generated test cases. The evaluation criteria we established for this assessment are:

作为我们实验的一部分,我们生成并执行了测试套件。最终目标是能够使用生成式 AI 为软件应用程序提供质量保证。在本研究中,我们希望确认我们独特的站点表示方法是否有助于实现有效的质量工程,从而使大语言模型能够进行上下文学习。我们在方法论部分讨论了这种 Web 应用程序表示方法如何为实现 AI 增强的质量工程奠定基础。为了评估我们数据结构的整体效率,我们评估了生成的测试用例的质量。我们为此评估建立的评估标准是:

These criteria for evaluation are relevant for any enterprise test automation project and are regularly used by automation engineers, implicitly or explicitly. The final test execution results are produced by the Selenium test automation tool which is orchestrated by our AI platform, Flame. Based on the outlined evaluation criteria, the performance and efficiency of the developed approach for representing enterprise web application structures using Large Language Models (LLMs) in the service of intelligent quality engineering were analysed using two distinct use cases: the Swag Labs web application and the Atalgo Engineering’s development/testing environment, known as MediBox platform. We present below results of various components of our experiments, highlighting the relevance and impact of the novel structure in overall functional testing. We will show case the generated test suites for both MediBox and Swag Labs platforms then we will provide assessment against the evaluation criteria.

这些评估标准适用于任何企业测试自动化项目,并且经常被自动化工程师隐式或显式地使用。最终的测试执行结果由 Selenium 测试自动化工具生成,该工具由我们的 AI 平台 Flame 进行编排。基于上述评估标准,我们通过两个不同的用例分析了使用大语言模型 (LLMs) 表示企业 Web 应用程序结构以服务于智能质量工程的性能和效率:Swag Labs Web 应用程序和 Atalgo Engineering 的开发/测试环境,即 MediBox 平台。我们在下面展示了实验的各个组件的结果,强调了新结构在整体功能测试中的相关性和影响。我们将展示为 MediBox 和 Swag Labs 平台生成的测试套件,然后根据评估标准进行评估。

5.1 Evaluation

5.1 评估

We have split the entire evaluation process into 5 steps. As the novel representation method is being used in 5 phases, we had to evaluate each phase to get complete evaluation. The process or phases are sequentially dependent i.e the next step always depends upon the output of previous step. Let’s discuss the evaluation steps in this section:

我们将整个评估过程分为5个步骤。由于新的表示方法在5个阶段中使用,我们必须对每个阶段进行评估以获得完整的评估结果。这些过程或阶段是顺序依赖的,即下一步总是依赖于上一步的输出。让我们在本节中讨论评估步骤:

Table 1: Generated Test Suite for Medibox Application

表 1: Medibox 应用程序生成的测试套件

测试用例 ID 测试用例名称 优先级 描述
TC01 验证导航到用户注册页面 检查是否可以从主页导航到用户注册页面。
TC02 验证唯一手机号注册 确保在注册过程中每个用户的手机号是唯一的。
TC03 验证唯一邮箱地址注册 确保在注册过程中每个用户的邮箱地址是唯一的。
TC04 验证密码和确认密码匹配 检查在表单提交前密码和确认密码字段是否匹配。
TC05 验证用户注册表单中的必填字段 测试邮箱字段是否接受有效输入。
TC06 验证用户注册成功 确保在表单提交前所有必填字段都已填写。
TC07 验证密码强度要求 确保密码符合所需的强度标准。
TC08 验证邮箱地址格式验证 检查输入的邮箱地址是否符合正确的格式。
TC09 验证手机号格式验证 检查输入的手机号是否符合正确的格式。
TC10 验证注册后导航到登录页面 检查用户注册后是否可以导航到登录页面。

Table 2: Generated Test Suite for Swag Labs

表 2: Swag Labs 生成的测试套件

测试用例 ID 测试用例名称 优先级 描述
TC01 使用有效凭据登录 使用有效的用户名/密码测试登录功能。
TC02 使用无效用户名登录 使用无效用户名和有效密码测试登录。
TC03 使用无效密码登录 使用有效用户名和无效密码测试登录。
TC04 使用空用户名登录 验证用户名为空时的登录错误。
TC05 使用空密码登录 验证密码为空时的登录错误。
TC06 使用锁定用户登录 测试锁定用户的凭据返回正确的错误。
TC07 使用性能故障用户登录 使用 performance_glitch_user 测试登录。
TC08 使用所有字段为空登录 验证用户名/密码为空时的错误消息。
TC09 验证无效输入的错误消息 验证凭据无效时的错误消息。
TC10 使用问题用户登录 使用 problem_user 测试登录并验证问题。

5.1.1 Test Case Execution Success Rate

5.1.1 测试用例执行成功率

The test execution success rate was calculated as the percentage of test cases that executed successfully without errors (and they were expected to behave the same way). We have used Selenium (Test Automation Platform) to perform automated test execution from the generated test suites. The results for both applications are summarized in Table 3.

测试执行成功率计算为成功执行且无错误的测试用例所占百分比(预期它们的行为方式相同)。我们使用了 Selenium(测试自动化平台)从生成的测试套件中执行自动化测试。两个应用程序的结果总结在表 3 中。

Table 3: Test Case Execution Success Rate

表 3: 测试用例执行成功率

应用程序 总测试用例数 通过的测试用例数 失败的测试用例数 成功率 (%)
Swag Labs 10 9 1 90.00
MediBox 10 7 3 70.00

Swag Labs Results:

Swag Labs 结果:

• Total Tests Executed: 10

• 执行的总测试数:10

• Success Rate: 90 • Observations:

• 成功率:90
• 观察结果:

MediBox Results:

MediBox 结果:

• Total Tests Executed: 10

• 执行的测试总数:10

• Success Rate: 70

• 成功率:70

• Observations:

• 观察:

– 7 test cases passed successfully, including critical validations for email format, mobile number uniqueness, and password matching.

– 7 个测试用例成功通过,包括对电子邮件格式、手机号唯一性和密码匹配的关键验证。

5.1.2 Relevance of Test Cases Based on Instructions

5.1.2 基于指令的测试用例相关性

Swag Labs:

Swag Labs:

• Instructions: Fetch all input fields, buttons, and labels from the homepage of Sauce Demo (https: //www.saucedemo.com/) and provide the details for each element, including their id, class, and attributes. Generate a detailed test plan for logging in to the Sauce Demo application using valid credentials (standard_user/secret sauce) and invalid credentials. Include preconditions, steps, expected outcomes, and priority for each test. Create a Selenium test script in Python to automate the login functionality of Sauce Demo. The script should test both successful and failed login attempts and validate error messages. Execute the generated Selenium test script for Sauce Demo and provide a detailed test summary, including pass/fail results, screenshots of failures, and execution time for each test.

• 指令:从Sauce Demo的主页(https://www.saucedemo.com/)获取所有输入字段、按钮和标签,并提供每个元素的详细信息,包括它们的id、class和属性。生成一个详细的测试计划,用于使用有效凭证(standard_user/secret sauce)和无效凭证登录Sauce Demo应用程序。包括每个测试的前提条件、步骤、预期结果和优先级。创建一个Python语言的Selenium测试脚本,用于自动化Sauce Demo的登录功能。该脚本应测试成功和失败的登录尝试,并验证错误消息。执行生成的Selenium测试脚本,并提供详细的测试摘要,包括通过/失败结果、失败的截图以及每个测试的执行时间。

• Test cases directly addressed the login functionality’s critical scenarios, confirming adherence to instructional goals.

• 测试用例直接针对登录功能的关键场景,确认其符合教学目标。

• Coverage included variations in credentials, field emptiness, and user-specific conditions.

• 覆盖范围包括凭证的变体、字段为空以及用户特定条件。

Medibox:

Medibox:

• Instructions: Create and execute a minimum of 10 functional test scripts specifically for the user signup process.

• 指令:创建并执行至少10个专门用于用户注册流程的功能测试脚本。

• Deployed on our experiment environment • User Signup Endpoint URL: /UserSignup

• 部署在我们的实验环境中
• 用户注册端点 URL: /UserSignup

• Ensure the test cases cover a wide range of scenarios for comprehensive validation.

• 确保测试用例涵盖广泛的场景,以进行全面验证。

Details:

详情:

• Test cases were comprehensive, targeting user registration with field validations, navigation checks, and logical consistency. • Generated scenarios matched the specified instructions for ensuring accurate and secure user onboarding.

• 测试用例全面,针对用户注册的字段验证、导航检查和逻辑一致性。
• 生成的场景符合确保准确和安全用户注册的指定指令。

5.1.3 Relevance of Test Cases for the Web Application

5.1.3 Web 应用程序测试用例的相关性

Swag Labs:

Swag Labs:

• Structural insights from the DOM allowed the generation of con textually relevant test cases tailored to the application’s architecture. • Example: Dynamic field validations and error feedback mechanisms were precisely targeted.

• 从 DOM 中获得的结构洞察使我们能够生成与应用程序架构相关的上下文测试用例。
• 示例:动态字段验证和错误反馈机制被精确地定位。

MediBox:

MediBox:

5.1.4 Test Data Mappings Relevance

5.1.4 测试数据映射的相关性

Swag Labs:

Swag Labs:

• Mapped data consistently reflected the application’s structural metadata, enabling accurate input-output scenarios. • Example: User-specific identifiers like "locked-out user" and "problem user" ensured realistic validation.

• 映射的数据始终反映应用程序的结构元数据,确保了准确的输入输出场景。
• 示例:诸如“锁定用户”和“问题用户”等用户特定标识符确保了真实的验证。

MediBox:

MediBox:

5.1.5 Synthetic Data Generation Contextual Relevance

5.1.5 合成数据生成的上下文相关性

Swag Labs:

Swag Labs:

MediBox:

MediBox:

Contextual relevance was evident in generated data for email, mobile number, and password fields. Observations: – Password strength validation scenarios benefited significantly from the synthesized data.

在生成的电子邮件、手机号码和密码字段数据中,上下文相关性显而易见。观察结果:– 密码强度验证场景从合成数据中受益匪浅。

Although not directly related to the objective of this specific research, we have timeboxed the time taken in each of these activities because in actual real-world projects, speed of execution is of high importance. This provides significant savings at scale. We have concluded that once the initial setup and the configuration of the requirements, instructions etc are complete, this approach provides significant time savings (upwards of $50%$ ) compared to a traditional test automation approach. This saving is more pronounced in the maintenance phase of the project and becomes significant as the software application scales.

虽然这些活动与本研究的特定目标没有直接关系,但我们为每项活动设定了时间限制,因为在现实世界的项目中,执行速度至关重要。这种方法在大规模应用中能显著节省时间。我们得出的结论是,一旦完成初始设置和需求、指令等的配置,与传统测试自动化方法相比,这种方法可以显著节省时间(高达 $50%$)。这种节省在项目的维护阶段更为明显,并且随着软件应用的扩展而变得更加显著。

Table 4: Summary Table of Results

表 4: 结果汇总表

标准 Swag Labs 成功率 MediBox 成功率
测试用例执行成功率 90% 70%
指令相关性
网络应用相关性
数据映射相关性
合成数据上下文相关性

6 Future Work and Conclusion

6 未来工作与结论

Our research demonstrates the successful application of LLM’s few-shot learning capabilities in automated test script generation and execution, despite using models not specifically trained for the functional testing domain. The results validate our approach to hierarchical web application representation while also highlighting areas for future enhancement. From this point onward, towards the goals of computational model of quality engineering, two primary directions emerge for the future research. First, we plan to fine-tune LLMs specifically for the test automation domain using curated datasets. This specialized training should address current limitations in handling large context windows and input sizes for individual requests. Second, we propose developing a knowledge graph architecture to store and efficiently retrieve element-level data during run-time, potentially reducing the input size required for page-level test case generation. Although our solution demonstrates robust performance across most web applications in generating quality test suites, we acknowledge certain limitations. Applications with exceptionally large DOM structures can challenge our algorithm’s ability to maintain hierarchical relationships. Furthermore, the substantial input sizes required by our representation method may lead to increased costs when using commercial LLM services, although the benefits of improved test coverage and maintenance efficiency will often justify this investment. The methodology’s strength lies in its ability to enable LLMs to comprehend and interact with web application structures through in-context learning, producing con textually relevant test cases. Looking ahead, we believe that the integration of domain-specific fine-tuning and knowledge graph-based element retrieval will significantly mitigate current limitations, further enhancing the s cal ability and cost-effectiveness of our approach. These improvements will pave the way for more efficient and autonomous quality engineering processes in enterprise web applications.

我们的研究表明,尽管使用的模型并未专门针对功能测试领域进行训练,但大语言模型 (LLM) 的少样本学习能力在自动化测试脚本生成和执行中得到了成功应用。结果验证了我们在分层 Web 应用程序表示方面的方法,同时也指出了未来需要改进的领域。从这一点出发,朝着质量工程计算模型的目标,未来的研究有两个主要方向。首先,我们计划使用精选的数据集专门为测试自动化领域微调大语言模型。这种专门的训练应解决当前在处理大上下文窗口和单个请求的输入大小方面的限制。其次,我们建议开发一种知识图谱架构,以在运行时存储和高效检索元素级数据,从而可能减少页面级测试用例生成所需的输入大小。尽管我们的解决方案在生成高质量测试套件方面在大多数 Web 应用程序中表现出色,但我们承认存在某些限制。具有异常大 DOM 结构的应用程序可能会挑战我们算法维护分层关系的能力。此外,我们的表示方法所需的较大输入大小可能会增加使用商业大语言模型服务的成本,尽管改进的测试覆盖率和维护效率的好处通常会证明这一投资的合理性。该方法的力量在于它能够通过上下文学习使大语言模型理解并与 Web 应用程序结构交互,从而生成上下文相关的测试用例。展望未来,我们相信,结合领域特定的微调和基于知识图谱的元素检索将显著缓解当前的限制,进一步提高我们方法的可扩展性和成本效益。这些改进将为企业 Web 应用程序中更高效和自主的质量工程流程铺平道路。

阅读全文(20积分)