RDF data storage techniques for efficient SPARQL query processing using distributed computation engines

Mahmudul Hassan; Srividya Bansal

doi:10.1109/IRI.2018.00056

RDF data storage techniques for efficient SPARQL query processing using distributed computation engines

Mahmudul Hassan, Srividya Bansal

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

9 Scopus citations

Abstract

The rapidly growing amount of linked open data demands semantic RDF services that are efficient, scalable, and distributed along with high availability for reuse and fault tolerance. To address this concern, the Big Data processing infrastructure Hadoop has been adopted for RDF data management systems. In this paper, we introduce distributed RDF data stores, namely VPExp and 3CStore, based on the existing vertical partitioning (VP) approach. In the VPExp approach, we propose splitting of predicates based on explicit type information of an object. The 3CStore scheme is designed with a 3-column store, comprising of a subset of triples from the VP table based on different join correlations, to reduce the number of join operations while executing SPARQL queries as SQL in a distributed system. We evaluate these two RDF data storage approaches by comparing them with vertical partitioning approach and state-of-the-art RDF management system S2RDF. We also present an evaluation of query performance of these systems built upon two popular distributed computation engines namely, Spark and Drill.

Original language	English (US)
Title of host publication	Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	323-330
Number of pages	8
ISBN (Print)	9781538626597
DOIs	https://doi.org/10.1109/IRI.2018.00056
State	Published - Aug 2 2018
Event	19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018 - Salt Lake City, United States Duration: Jul 7 2018 → Jul 9 2018

Publication series

Name	Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018

Other

Other	19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018
Country/Territory	United States
City	Salt Lake City
Period	7/7/18 → 7/9/18

Keywords

Drill
Hadoop
In-memory processing engine
Information reuse
RDF data storage
SPARQL Querying
Semantic web
Spark

ASJC Scopus subject areas

Computer Networks and Communications
Software
Artificial Intelligence
Information Systems and Management
Safety, Risk, Reliability and Quality
Public Administration

Access to Document

10.1109/IRI.2018.00056

Cite this

Hassan, M., & Bansal, S. (2018). RDF data storage techniques for efficient SPARQL query processing using distributed computation engines. In Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018 (pp. 323-330). Article 8424727 (Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IRI.2018.00056

RDF data storage techniques for efficient SPARQL query processing using distributed computation engines. / Hassan, Mahmudul; Bansal, Srividya.
Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc., 2018. p. 323-330 8424727 (Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Hassan, M & Bansal, S 2018, RDF data storage techniques for efficient SPARQL query processing using distributed computation engines. in Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018., 8424727, Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018, Institute of Electrical and Electronics Engineers Inc., pp. 323-330, 19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018, Salt Lake City, United States, 7/7/18. https://doi.org/10.1109/IRI.2018.00056

Hassan M, Bansal S. RDF data storage techniques for efficient SPARQL query processing using distributed computation engines. In Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc. 2018. p. 323-330. 8424727. (Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018). doi: 10.1109/IRI.2018.00056

Hassan, Mahmudul ; Bansal, Srividya. / RDF data storage techniques for efficient SPARQL query processing using distributed computation engines. Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 323-330 (Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018).

@inproceedings{c0346132dd0c4b7a932db688063a4b94,

title = "RDF data storage techniques for efficient SPARQL query processing using distributed computation engines",

abstract = "The rapidly growing amount of linked open data demands semantic RDF services that are efficient, scalable, and distributed along with high availability for reuse and fault tolerance. To address this concern, the Big Data processing infrastructure Hadoop has been adopted for RDF data management systems. In this paper, we introduce distributed RDF data stores, namely VPExp and 3CStore, based on the existing vertical partitioning (VP) approach. In the VPExp approach, we propose splitting of predicates based on explicit type information of an object. The 3CStore scheme is designed with a 3-column store, comprising of a subset of triples from the VP table based on different join correlations, to reduce the number of join operations while executing SPARQL queries as SQL in a distributed system. We evaluate these two RDF data storage approaches by comparing them with vertical partitioning approach and state-of-the-art RDF management system S2RDF. We also present an evaluation of query performance of these systems built upon two popular distributed computation engines namely, Spark and Drill.",

keywords = "Drill, Hadoop, In-memory processing engine, Information reuse, RDF data storage, SPARQL Querying, Semantic web, Spark",

author = "Mahmudul Hassan and Srividya Bansal",

note = "Publisher Copyright: {\textcopyright} 2018 IEEE.; 19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018 ; Conference date: 07-07-2018 Through 09-07-2018",

year = "2018",

month = aug,

day = "2",

doi = "10.1109/IRI.2018.00056",

language = "English (US)",

isbn = "9781538626597",

series = "Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "323--330",

booktitle = "Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018",

}

TY - GEN

T1 - RDF data storage techniques for efficient SPARQL query processing using distributed computation engines

AU - Hassan, Mahmudul

AU - Bansal, Srividya

PY - 2018/8/2

Y1 - 2018/8/2

N2 - The rapidly growing amount of linked open data demands semantic RDF services that are efficient, scalable, and distributed along with high availability for reuse and fault tolerance. To address this concern, the Big Data processing infrastructure Hadoop has been adopted for RDF data management systems. In this paper, we introduce distributed RDF data stores, namely VPExp and 3CStore, based on the existing vertical partitioning (VP) approach. In the VPExp approach, we propose splitting of predicates based on explicit type information of an object. The 3CStore scheme is designed with a 3-column store, comprising of a subset of triples from the VP table based on different join correlations, to reduce the number of join operations while executing SPARQL queries as SQL in a distributed system. We evaluate these two RDF data storage approaches by comparing them with vertical partitioning approach and state-of-the-art RDF management system S2RDF. We also present an evaluation of query performance of these systems built upon two popular distributed computation engines namely, Spark and Drill.

AB - The rapidly growing amount of linked open data demands semantic RDF services that are efficient, scalable, and distributed along with high availability for reuse and fault tolerance. To address this concern, the Big Data processing infrastructure Hadoop has been adopted for RDF data management systems. In this paper, we introduce distributed RDF data stores, namely VPExp and 3CStore, based on the existing vertical partitioning (VP) approach. In the VPExp approach, we propose splitting of predicates based on explicit type information of an object. The 3CStore scheme is designed with a 3-column store, comprising of a subset of triples from the VP table based on different join correlations, to reduce the number of join operations while executing SPARQL queries as SQL in a distributed system. We evaluate these two RDF data storage approaches by comparing them with vertical partitioning approach and state-of-the-art RDF management system S2RDF. We also present an evaluation of query performance of these systems built upon two popular distributed computation engines namely, Spark and Drill.

KW - Drill

KW - Hadoop

KW - In-memory processing engine

KW - Information reuse

KW - RDF data storage

KW - SPARQL Querying

KW - Semantic web

KW - Spark

UR - http://www.scopus.com/inward/record.url?scp=85052300159&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85052300159&partnerID=8YFLogxK

U2 - 10.1109/IRI.2018.00056

DO - 10.1109/IRI.2018.00056

M3 - Conference contribution

AN - SCOPUS:85052300159

SN - 9781538626597

T3 - Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018

SP - 323

EP - 330

BT - Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018

Y2 - 7 July 2018 through 9 July 2018

ER -

RDF data storage techniques for efficient SPARQL query processing using distributed computation engines

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this