Digital Twin Buildings: 3D Modeling, GIS Integration, and Visual Descriptions Using Gaussian Splatting, ChatGPT/Deepseek, and Google Maps Platform
数字孪生建筑:使用高斯溅射 (Gaussian Splatting)、ChatGPT/Deepseek 和 Google Maps 平台的 3D 建模、GIS 集成与可视化描述
Abstract—Urban digital twins are virtual replicas of cities that use multi-source data and data analytics to optimize urban planning, infrastructure management, and decision-making. Towards this, we propose a framework focused on the singlebuilding scale. By connecting to cloud mapping platforms such as Google Map Platforms APIs, by leveraging state-of-the-art multiagent Large Language Models data analysis using ChatGPT(4o) and Deepseek-V3/R1, and by using our Gaussian Splatting-based mesh extraction pipeline, our Digital Twin Buildings framework can retrieve a building’s 3D model, visual descriptions, and achieve cloud-based mapping integration with large language model-based data analytics using a building’s address, postal code, or geographic coordinates.
摘要:城市数字孪生 (urban digital twins) 是利用多源数据 (multi-source data) 和数据分析 (data analytics) 来优化城市规划、基础设施管理和决策的城市的虚拟复制品。为此,我们提出了一个专注于单栋建筑尺度的框架。通过连接到 Google Map Platforms API 等云地图平台,利用 ChatGPT(4o) 和 Deepseek-V3/R1 等最先进的多智能体大语言模型 (multi-agent Large Language Model) 进行数据分析,并使用基于高斯泼溅 (Gaussian Splatting) 的网格提取管道,我们的数字孪生建筑框架可以检索建筑物的 3D 模型、视觉描述,并通过建筑物的地址、邮政编码或地理坐标实现基于云的映射集成与基于大语言模型的数据分析。
Index Terms—Gaussian Splatting, ChatGPT, Deepseek, Large Language Models, Multi-Agent, AI, 3D Reconstruction, Google Maps, Remote Sensing, Urban Buildings, Urban Digital Twin
关键词—高斯溅射 (Gaussian Splatting)、ChatGPT、Deepseek、大语言模型、多智能体、AI、3D重建、Google Maps、遥感、城市建筑、城市数字孪生
I. INTRODUCTION
I. 引言
In this manuscript, we present Digital Twin Building (DTB), a framework which allows for the extraction of the 3D mesh model of a building, along with Cloud Mapping Service Integration and Multi-Agent Large Language Models (LLM) for data analysis. In the scope of this paper, we use the framework to retrieve Gaussian Splatting models and 3D mesh models. We also retrieve fundamental geocoding information, mapping information and 2D images, and perform visual analysis on the 2D images using the Multi-Agent LLM module. This is shown in Fig. 1.
在本论文中,我们提出了数字孪生建筑(Digital Twin Building, DTB)框架,该框架能够提取建筑物的3D网格模型,并结合云映射服务集成和多智能体大语言模型(LLM)进行数据分析。在本文的范围内,我们使用该框架来获取高斯点云模型和3D网格模型。我们还能获取基本的地理编码信息、映射信息和2D图像,并使用多智能体LLM模块对2D图像进行视觉分析。如图1所示。
Depending on need, the Google Maps Platform Integration can also retrieve local elevation maps, real-time traffic data, air quality data and access other data sources and services, which can then be analyzed.
根据需求,Google Maps Platform Integration 还可以检索本地高程地图、实时交通数据、空气质量数据,并访问其他数据源和服务,以便进行分析。
Our contributions are as follows.
我们的贡献如下。
II. BACKGROUND AND RELATED WORKS
II. 背景与相关工作
A. ChatGPT/Deepseek and API
A. ChatGPT/Deepseek 和 API
Large Language Models (LLMs) are neural networks, typically Transformer-based [1], pre-trained on extensive, diverse text/image corpora, typically sourced from web crawls. These models, designed for Natural Language Processing (NLP), typically interpret text-based prompts and generate text-based outputs. Certain models, such as ”DeepseekV3/R1” and their variants [2], [3], support object character recognition (OCR, i.e., reading text from images). Models like ”ChatGPT-4o” [4] and its variants additionally support full interpretation and analysis of image content.
大语言模型 (LLMs) 是神经网络,通常基于 Transformer [1],在广泛而多样的文本/图像语料库上进行预训练,这些语料库通常来自网络爬取。这些模型专为自然语言处理 (NLP) 设计,通常解释基于文本的提示并生成基于文本的输出。某些模型,如 “DeepseekV3/R1” 及其变体 [2]、[3],支持光学字符识别 (OCR,即从图像中读取文本)。像 “ChatGPT-4o” [4] 及其变体等模型还支持对图像内容的完整解释和分析。
LLMs have achieved widespread adoption since 2023. Beyond basic image and text interpretation, these models recently exhibited expert-level problem-solving in various scientific and engineering domains [5], [6].
自 2023 年以来,大语言模型 (LLMs) 已被广泛采用。除了基本的图像和文本解释外,这些模型最近还在各种科学和工程领域展示了专家级的问题解决能力 [5]、[6]。
Due to their large size, LLMs often face hardware constraints for local deployment. While popular LLM providers such as OpenAI and Deepseek, provide web browser interfaces for their models, they also offer Application Programming Interfaces (APIs). These APIs enable client-side software or code to query LLMs hosted on OpenAI or Deepseek servers, facilitating large-scale data processing without requiring humanin-the-loop manipulations via browser interfaces. Unlike traditional local deep learning, which necessitates GPUs for both training and inference, API-based LLM querying requires minimal local hardware and can be deloyed on devices such as mobile phones.
由于大语言模型规模庞大,本地部署时经常面临硬件限制。虽然 OpenAI 和 Deepseek 等知名大语言模型提供商为其模型提供了网页浏览器界面,但它们也提供了应用程序编程接口 (API)。这些 API 使得客户端软件或代码能够查询托管在 OpenAI 或 Deepseek 服务器上的大语言模型,从而无需通过浏览器界面进行人工干预即可实现大规模数据处理。与传统的本地深度学习需要在训练和推理时使用 GPU 不同,基于 API 的大语言模型查询对本地硬件要求极低,可以部署在手机等设备上。
Fig. 1: Diagram of our Digital Twin Building framework. External modules are boxed in red. Our own tools/modules are boxed in blue. The aspects specifically presented in this paper are in dark blue. Data (both inputs and outputs) are drawn in plain text.
图1: 数字孪生建筑框架示意图。外部模块用红色框标注。我们的工具/模块用蓝色框标注。本文特别展示的方面用深蓝色标注。数据(包括输入和输出)以纯文本形式绘制。
TABLE I: TABLE OF IMPORTANT DEEPSEEK AND OPENAI LLMs
Table compiled on 2025-01-31. OpenAI models are not open-sourced, their model sizes (parameters) are estimated ( $\mathbf{B}=\mathbf{\partial}$ billions, $\mathbf{M}=$ millions). 1We did not include gpt-o1 in our experiments due to cost, but we include its specifications for comparison. *The Deepseek V3 API call input token price is discounted by $90%$ if input caching is used for repeated identical prompting.*
表 1: IMPORTANT DEEPSEEK AND OPENAI LLMs 表
ModelName | Model 1Class | Model Type | Image Processing | Parameters | APIcallp price/1M Input Tokens (USD) | API call price/1M Output Tokens (USD) |
---|---|---|---|---|---|---|
chatgpt4o-latest | GPT40 | Autoregressive | Analysis | ~1000+B | 2.5 | 10 |
gpt-4o-mini | GPT4oMini | Autoregressive | Analysis | ~10'sofB | 0.15 | 0.6 |
deepseek-chat | DeepseekV3 | Autoregressive | OCR | 617B | 0.14×0.1 | 1.10 |
deepseek-reasoner | Deepseek R1 (V3-base) | Reasoning | OCR | 617B | 0.14 | 2.19 |
gpt-o11 | GPT-o1 (GPT4-base) | Reasoning | None | ~175B | 15 | 60 |
表于2025-01-31编译。OpenAI模型未开源,其模型大小(参数量)为估计值($\mathbf{B}=\mathbf{\partial}$ billions, $\mathbf{M}=$ millions)。1由于成本原因,我们没有在实验中使用gpt-o1,但包含了其规格以进行比较。*如果对重复相同的提示使用了输入缓存,Deepseek V3 API调用输入Token的价格将享受$90%$的折扣。*
B. Google Maps Platform API
B. Google Maps Platform API
Google Map Platform is a cloud-based mapping service and a part of Google Cloud. Its API allows the client device to connect to various cloud-based GIS, mapping, and remote sensing services hosted on the Google Cloud servers.
Google Map Platform 是基于云的映射服务,是 Google Cloud 的一部分。其 API 允许客户端设备连接到托管在 Google Cloud 服务器上的各种基于云的地理信息系统(GIS)、映射和遥感服务。
The services utilized in this research include remote sensing image retrieval, map retrieval, elevation data retrieval, geocoding/reverse geocoding, and building polygon retrieval. However, Google Maps Platform also offers other APIs for urban and environmental research, including real-time traffic data, solar potential data, air quality data, and plant pollen data, in addition to the full suite of commonly used Google Maps navigation and mapping tools.
本研究所使用的服务包括遥感图像检索、地图检索、高程数据检索、地理编码/反向地理编码和建筑多边形检索。然而,Google Maps Platform 还提供了其他适用于城市和环境研究的 API,除了全套常用的 Google Maps 导航和地图工具外,还包括实时交通数据、太阳能潜力数据、空气质量数据和植物花粉数据。
Although less known in the remote sensing and GIS community than its sister application Google Earth Engine, Google Map Platform has been used in a variety of GIS research including navigation, object tracking, city modeling, image and map retrieval, geospatial data analysis for commercial and industrial applications [7]–[10]. It is also used as part of many commercial software for cloud-based mapping integration.
尽管在遥感和GIS(地理信息系统)社区中不如其姊妹应用Google Earth Engine知名,Google Map Platform已被用于多种GIS研究,包括导航、物体追踪、城市建模、图像和地图检索,以及商业和工业应用中的地理空间数据分析 [7]–[10]。它也被用作许多基于云的地图集成商业软件的一部分。
C. Google Earth Studio
C. Google Earth Studio
Google Earth Studio [11] is a web-based animation tool that leverages Google Earth’s satellite imagery and 3D terrain data. The tool is especially useful for creating geospatial visualizations, as it is integrated with Google Earth’s geographic data. It allows for the retrieval of images from user-specified camera poses at user-specified locations. In this research, we use Google Earth Studio to retrieve $360~^{\circ}$ multi-view remote sensing images of a building from its address, postal code, place name, or geographic coordinates following [12], [13].
Google Earth Studio [11] 是一款基于网络的动画工具,利用了 Google Earth 的卫星图像和 3D 地形数据。该工具特别适合创建地理空间可视化,因为它集成了 Google Earth 的地理数据。它允许从用户指定的相机姿势和用户指定的位置检索图像。在本研究中,我们使用 Google Earth Studio 根据建筑地址、邮政编码、地名或地理坐标检索其 $360~^{\circ}$ 多视角遥感图像,遵循 [12]、[13] 中的方法。
III. METHOD
III. 方法
A. Gaussian Building Mesh Extraction
高斯建筑网格提取
We use the mesh extraction procedure we introduced in December 2024 [13]. For conciseness, the process is briefly described here, and is not benchmarked. We refer the readers to [13] for the original implementation details and benchmark comparisons. We also refer the readers to [14] for background and theory on Gaussian Splatting.
我们使用了2024年12月引入的网格提取流程 [13]。为简洁起见,此处简要描述该流程,且不进行基准测试。关于原始实现细节和基准比较,请读者参阅 [13]。关于高斯泼溅的背景和理论,请读者参阅 [14]。
Gaussian Building Mesh (GBM) [13] is an end-to-end 3D building mesh extraction pipeline we recently proposed, leveraging Google Earth Studio [11], Segment Anything Model2 (SAM2) [15] and Grounding DINO [16], and a modified [17] Gaussian Splatting [14] model. The pipeline enables the extraction of a 3D building mesh from inputs such as the building’s name, address, postal code, or geographical coordinates. Since the GBM uses Gaussian Splatting (GS) as its 3D representation, it also allows for the synthesis of new photo realistic 2D images of the building under different viewpoints.
高斯建筑网格 (Gaussian Building Mesh, GBM) [13] 是我们最近提出的一种端到端三维建筑网格提取流程,它利用了 Google Earth Studio [11]、Segment Anything Model2 (SAM2) [15]、Grounding DINO [16] 以及改进的高斯溅射 (Gaussian Splatting) [14] 模型。该流程能够从建筑名称、地址、邮政编码或地理坐标等输入中提取三维建筑网格。由于 GBM 使用高斯溅射 (GS) 作为其三维表示方法,它还能够合成不同视角下建筑的照片级真实感二维图像。
B. Google Maps Platform Integration
B. Google Maps Platform 集成
We use the Python client binding for Google Maps Platform Services APIs to create an integration tool to automatically retrieve the GIS and mapping information of a building. For these image analysis experiments, the data is retrieved with four API calls. The first is a Google Maps Platform Geocoding/Reverse Geocoding API call which retrieves the complete address information including geographic coordinates, entrance(s) coordinates, and building polygon mask vertex coordinates. Then, a Google Maps Platform Elevation API call is used to retrieve the ground elevation using the building’s coordinates as input. Additional API calls to other Cloud Services can also be performed at this step. Finally, two API calls are made using the Google Maps Platform Static Maps API to retrieve map(s) and satellite/aerial image(s) at the desired zoom level. This process is illustrated in Figure 2. The aerial/satellite image(s) are then used as one of the inputs to our Multi-Agent LLM Module.
我们使用Google Maps Platform Services API的Python客户端绑定创建了一个集成工具,用于自动检索建筑物的GIS和地图信息。在这些图像分析实验中,数据通过四个API调用获取。第一个调用是Google Maps Platform地理编码/反向地理编码API,用于检索包括地理坐标、入口坐标和建筑物多边形遮罩顶点坐标在内的完整地址信息。接着,使用Google Maps Platform高程API,以建筑物的坐标为输入,检索地面高程。在此步骤中还可以执行对其他云服务的额外API调用。最后,使用Google Maps Platform静态地图API进行两次API调用,以获取所需缩放级别下的地图和卫星/航拍图像。该过程如图2所示。随后,航拍/卫星图像用作我们多智能体大语言模型模块的输入之一。
Our Google Map Platform Integration can easily be modified to retrieve additional data from the cloud-based mapping service by adding parallel API calls below the Geocoding/Reverse Geocoding API call. For example, if we wish to analyze real-time traffic data, we can simply perform API calls to the Traffic API.
我们的谷歌地图平台 (Google Map Platform) 集成可以轻松修改,通过在地理编码/反向地理编码 (Geocoding/Reverse Geocoding) API 调用下方添加并行 API 调用来从基于云的地图服务中检索更多数据。例如,如果我们希望分析实时交通数据,我们可以简单地调用交通 API (Traffic API)。
Satellite image(s) Street map(s) additional data of interest
卫星图像 街道地图 相关附加数据
Fig. 2: Diagram of our Google Map Services Integration Tool.
图 2: Google 地图服务集成工具示意图。
C. Multi-Agent LLM Analysis of Multi-View/Scale Images
C. 多视图/尺度图像的多智能体大语言模型分析
The motivation of this module is to create a multi-agent LLM system to analyze the data retrieved from Google Cloud Platform Services integration. In this paper, we restrict the scope of this paper to the multi-agent content analysis of multiview/scale images.
本模块的动机是创建一个多智能体大语言模型系统,以分析从Google Cloud Platform Services集成中检索到的数据。在本文中,我们将研究范围限制在多视图/多尺度图像的多智能体内容分析上。
For this analysis, the primary goal is to retrieve and store keywords describing the architectural style, function, landscape, and architectural elements of the building using keywords. With the idea that an accurate set of keywords will allow an agent to reconstruct a text-based description of the image without actually seeing the image, we also retrieve the caption using keywords as a secondary objective. This process is illustrated in Fig. 3.
在此分析中,主要目标是通过关键词检索和存储描述建筑风格、功能、景观和建筑元素的关键词。基于一组准确的关键词可以使AI智能体在没有实际看到图像的情况下重建基于文本的图像描述的设想,我们还通过关键词检索图像标题作为次要目标。该过程如图3所示。
agent and prompt it to analyze the image and retrieve a set of keywords for each image. We then initiate two agents, one to aggregate the keywords from all the images of the building, and one to turn the aggregated keywords into a human-readable caption description.
智能体并提示其分析图像并检索每张图像的关键词。然后,我们启动两个智能体,一个用于汇总建筑所有图像的关键词,另一个将汇总的关键词转换为人类可读的描述。
From a building’s address, place name, postal code, or geographic coordinates, we retrieve multi-view off-nadir images of the building of interest using Google Earth Studio (or use the ones previously retrieved in the GBM module). We also retrieve top-down view aerial/satellite image(s) at different scales using the building using Google Map Platform Integration. For each image, we initiate a GPT4o/GPT4o-mini
从建筑物的地址、地名、邮政编码或地理坐标中,我们使用 Google Earth Studio 检索目标建筑物的多视角非天底图像(或使用之前在 GBM 模块中检索的图像)。我们还使用 Google 地图平台集成检索了建筑物在不同比例下的俯视航拍/卫星图像。对于每张图像,我们启动 GPT4o/GPT4o-mini
Fig. 3: Diagram of Multi Agent LLM processing multi-viewmulti-scale images.
图 3: 多智能体大语言模型处理多视角多尺度图像的示意图。
D. Metrics
D. 指标
Although metrics such as BLEU and CIDEr are commonly used to evaluate captioning performance, they require supervised datasets with ground truth captions. However, our images lack ground truth captions. Therefore, we use the CLIP (Contrastive Language–Image Pre training) score [18]. A CLIP-trained Transformer is used to embed both the caption and image into a shared image-language latent space. Given the text embedding of the caption, t and the image embedding of the corresponding image, i, the CLIP score is given by
尽管BLEU和CIDEr等指标常用于评估字幕生成性能,但它们需要带有真实字幕的有监督数据集。然而,我们的图像缺乏真实字幕。因此,我们使用了CLIP(对比语言-图像预训练)得分[18]。一个经过CLIP训练的Transformer被用来将字幕和图像嵌入到一个共享的图像-语言潜在空间中。给定字幕的文本嵌入t和对应图像的图像嵌入i,CLIP得分由以下公式给出:
We also use perplexity to roughly assess LLM image-tokeyword extraction confidence. Perplexity is a measure of the model’s confidence in its response, and is given by
我们还使用困惑度(perplexity)来大致评估大语言模型(LLM)图像到关键词提取的置信度。困惑度是模型对其响应信心的度量,其计算公式为
where $\log{P(x_{i})}$ is the log-probability of the token $x_{i}$ as assessed by the model during text-token auto regression.
其中 $\log{P(x_{i})}$ 是模型在文本Token自回归过程中评估的Token $x_{i}$ 的对数概率。
IV. EXPERIMENTS AND DISCUSSIONS
IV. 实验与讨论
A. Experiments
A. 实验
We chose s