TY - GEN
T1 - RDF data storage techniques for efficient SPARQL query processing using distributed computation engines
AU - Hassan, Mahmudul
AU - Bansal, Srividya
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/8/2
Y1 - 2018/8/2
N2 - The rapidly growing amount of linked open data demands semantic RDF services that are efficient, scalable, and distributed along with high availability for reuse and fault tolerance. To address this concern, the Big Data processing infrastructure Hadoop has been adopted for RDF data management systems. In this paper, we introduce distributed RDF data stores, namely VPExp and 3CStore, based on the existing vertical partitioning (VP) approach. In the VPExp approach, we propose splitting of predicates based on explicit type information of an object. The 3CStore scheme is designed with a 3-column store, comprising of a subset of triples from the VP table based on different join correlations, to reduce the number of join operations while executing SPARQL queries as SQL in a distributed system. We evaluate these two RDF data storage approaches by comparing them with vertical partitioning approach and state-of-the-art RDF management system S2RDF. We also present an evaluation of query performance of these systems built upon two popular distributed computation engines namely, Spark and Drill.
AB - The rapidly growing amount of linked open data demands semantic RDF services that are efficient, scalable, and distributed along with high availability for reuse and fault tolerance. To address this concern, the Big Data processing infrastructure Hadoop has been adopted for RDF data management systems. In this paper, we introduce distributed RDF data stores, namely VPExp and 3CStore, based on the existing vertical partitioning (VP) approach. In the VPExp approach, we propose splitting of predicates based on explicit type information of an object. The 3CStore scheme is designed with a 3-column store, comprising of a subset of triples from the VP table based on different join correlations, to reduce the number of join operations while executing SPARQL queries as SQL in a distributed system. We evaluate these two RDF data storage approaches by comparing them with vertical partitioning approach and state-of-the-art RDF management system S2RDF. We also present an evaluation of query performance of these systems built upon two popular distributed computation engines namely, Spark and Drill.
KW - Drill
KW - Hadoop
KW - In-memory processing engine
KW - Information reuse
KW - RDF data storage
KW - SPARQL Querying
KW - Semantic web
KW - Spark
UR - http://www.scopus.com/inward/record.url?scp=85052300159&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85052300159&partnerID=8YFLogxK
U2 - 10.1109/IRI.2018.00056
DO - 10.1109/IRI.2018.00056
M3 - Conference contribution
AN - SCOPUS:85052300159
SN - 9781538626597
T3 - Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018
SP - 323
EP - 330
BT - Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018
Y2 - 7 July 2018 through 9 July 2018
ER -