TY - GEN
T1 - Automatic Data Transformation Using Large Language Model - An Experimental Study on Building Energy Data
AU - Sharma, Ankita
AU - Li, Xuanmao
AU - Guan, Hong
AU - Sun, Guoxin
AU - Zhang, Liang
AU - Wang, Lanjun
AU - Wu, Kesheng
AU - Cao, Lei
AU - Zhu, Erkang
AU - Sim, Alexander
AU - Wu, Teresa
AU - Zou, Jia
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Existing approaches to automatic data transformation are insufficient to meet the requirements in many real-world scenarios, such as the building sector. First, there is no convenient interface for domain experts to provide domain knowledge easily. Second, they require significant training data collection overheads. Third, the accuracy suffers from complicated schema changes. To address these shortcomings, we present a novel approach that leverages the unique capabilities of large language models (LLMs) in coding, complex reasoning, and zero-shot learning to generate SQL code that transforms the source datasets into the target datasets. We demonstrate the viability of this approach by designing an LLM-based framework, termed SQLMorpher, which comprises a prompt generator that integrates the initial prompt with optional domain knowledge and historical patterns in external databases. It also implements an iterative prompt optimization mechanism that automatically improves the prompt based on flaw detection. The key contributions of this work include (1) pioneering an end-to-end LLM-based solution for data transformation, (2) developing a benchmark dataset of 105 real-world building energy data transformation problems, and (3) conducting an extensive empirical evaluation where our approach achieved 96% accuracy in all 105 problems. SQLMorpher demonstrates the effectiveness of utilizing LLMs in complex, domain-specific challenges, highlighting the potential of their potential to drive sustainable solutions.
AB - Existing approaches to automatic data transformation are insufficient to meet the requirements in many real-world scenarios, such as the building sector. First, there is no convenient interface for domain experts to provide domain knowledge easily. Second, they require significant training data collection overheads. Third, the accuracy suffers from complicated schema changes. To address these shortcomings, we present a novel approach that leverages the unique capabilities of large language models (LLMs) in coding, complex reasoning, and zero-shot learning to generate SQL code that transforms the source datasets into the target datasets. We demonstrate the viability of this approach by designing an LLM-based framework, termed SQLMorpher, which comprises a prompt generator that integrates the initial prompt with optional domain knowledge and historical patterns in external databases. It also implements an iterative prompt optimization mechanism that automatically improves the prompt based on flaw detection. The key contributions of this work include (1) pioneering an end-to-end LLM-based solution for data transformation, (2) developing a benchmark dataset of 105 real-world building energy data transformation problems, and (3) conducting an extensive empirical evaluation where our approach achieved 96% accuracy in all 105 problems. SQLMorpher demonstrates the effectiveness of utilizing LLMs in complex, domain-specific challenges, highlighting the potential of their potential to drive sustainable solutions.
KW - ChatGPT
KW - Text2SQL
KW - data transformation
KW - large language model
KW - smart building
UR - http://www.scopus.com/inward/record.url?scp=85184984231&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85184984231&partnerID=8YFLogxK
U2 - 10.1109/BigData59044.2023.10386931
DO - 10.1109/BigData59044.2023.10386931
M3 - Conference contribution
AN - SCOPUS:85184984231
T3 - Proceedings - 2023 IEEE International Conference on Big Data, BigData 2023
SP - 1824
EP - 1834
BT - Proceedings - 2023 IEEE International Conference on Big Data, BigData 2023
A2 - He, Jingrui
A2 - Palpanas, Themis
A2 - Hu, Xiaohua
A2 - Cuzzocrea, Alfredo
A2 - Dou, Dejing
A2 - Slezak, Dominik
A2 - Wang, Wei
A2 - Gruca, Aleksandra
A2 - Lin, Jerry Chun-Wei
A2 - Agrawal, Rakesh
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE International Conference on Big Data, BigData 2023
Y2 - 15 December 2023 through 18 December 2023
ER -