CLAST: Clustering Biological Sequences

Vicente Molieri; Lina Karam; Zoé Lacroix

doi:10.1016/B978-0-12-802508-6.00010-7

CLAST: Clustering Biological Sequences

Vicente Molieri, Lina Karam, Zoé Lacroix

Research output: Chapter in Book/Report/Conference proceeding › Chapter

Abstract

Clustering sequences is important in a variety of applications, including development of nonredundant databases, function prediction, and identifying patterns of gene expression. Currently, clustering methods rely on a prealignment as supplementary information to guide the construction of clusters. This chapter introduces a novel algorithm to cluster nucleotide and peptide sequences. The algorithm is a no-reference approach that utilizes only the sequences as input. We also introduce a novel metric that is used to describe the relationship between biological sequences, and serves as the distance measurement for clustering. Results are presented for real biological sequences, comparing the proposed algorithm to other similar tools available.

Original language	English (US)
Title of host publication	Emerging Trends in Computational Biology, Bioinformatics, and Systems Biology
Subtitle of host publication	Algorithms and Software Tools
Publisher	Elsevier Inc.
Pages	203-220
Number of pages	18
ISBN (Electronic)	9780128026465
ISBN (Print)	9780128025086
DOIs	https://doi.org/10.1016/B978-0-12-802508-6.00010-7
State	Published - Aug 7 2015

Keywords

Biological sequences
Clustering
Databases
Graph cuts
Hashing
Nucleotide
Peptide

ASJC Scopus subject areas

General Computer Science

Access to Document

10.1016/B978-0-12-802508-6.00010-7

Cite this

@inbook{366fadd058d94539a0d47376db4c5e59,

title = "CLAST: Clustering Biological Sequences",

abstract = "Clustering sequences is important in a variety of applications, including development of nonredundant databases, function prediction, and identifying patterns of gene expression. Currently, clustering methods rely on a prealignment as supplementary information to guide the construction of clusters. This chapter introduces a novel algorithm to cluster nucleotide and peptide sequences. The algorithm is a no-reference approach that utilizes only the sequences as input. We also introduce a novel metric that is used to describe the relationship between biological sequences, and serves as the distance measurement for clustering. Results are presented for real biological sequences, comparing the proposed algorithm to other similar tools available.",

keywords = "Biological sequences, Clustering, Databases, Graph cuts, Hashing, Nucleotide, Peptide",

author = "Vicente Molieri and Lina Karam and Zo{\'e} Lacroix",

note = "Funding Information: Funding: This research was partially supported by the National Science Foundation 1 1 (Grants IIS 0431174, IIS 0551444, IIS 0612273, IIS 0738906, IIS 0832551, and CNS 0849980, and several REU grants). We thank Naji Mounsef for sharing his hash-based approach for sequence assembly and prototype implemented in MatLab ( Mounsef et al., 2008 ). We thank Christophe Legendre and Ruben Acu{\~n}a for helping to supervise the undergraduate students Matthew Land, Ben J. Piorkowski, and Christopfer Watson, who put the tool to the test. We also acknowledge Louiqa Raschid and Ben Snyder for their contribution to BIPASS. Publisher Copyright: {\textcopyright} 2015 Elsevier Inc. All rights reserved. Copyright: Copyright 2017 Elsevier B.V., All rights reserved.",

year = "2015",

month = aug,

day = "7",

doi = "10.1016/B978-0-12-802508-6.00010-7",

language = "English (US)",

isbn = "9780128025086",

pages = "203--220",

booktitle = "Emerging Trends in Computational Biology, Bioinformatics, and Systems Biology",

publisher = "Elsevier Inc.",

}

TY - CHAP

T1 - CLAST

T2 - Clustering Biological Sequences

AU - Molieri, Vicente

AU - Karam, Lina

AU - Lacroix, Zoé

N1 - Funding Information: Funding: This research was partially supported by the National Science Foundation 1 1 (Grants IIS 0431174, IIS 0551444, IIS 0612273, IIS 0738906, IIS 0832551, and CNS 0849980, and several REU grants). We thank Naji Mounsef for sharing his hash-based approach for sequence assembly and prototype implemented in MatLab ( Mounsef et al., 2008 ). We thank Christophe Legendre and Ruben Acuña for helping to supervise the undergraduate students Matthew Land, Ben J. Piorkowski, and Christopfer Watson, who put the tool to the test. We also acknowledge Louiqa Raschid and Ben Snyder for their contribution to BIPASS. Publisher Copyright: © 2015 Elsevier Inc. All rights reserved. Copyright: Copyright 2017 Elsevier B.V., All rights reserved.

PY - 2015/8/7

Y1 - 2015/8/7

N2 - Clustering sequences is important in a variety of applications, including development of nonredundant databases, function prediction, and identifying patterns of gene expression. Currently, clustering methods rely on a prealignment as supplementary information to guide the construction of clusters. This chapter introduces a novel algorithm to cluster nucleotide and peptide sequences. The algorithm is a no-reference approach that utilizes only the sequences as input. We also introduce a novel metric that is used to describe the relationship between biological sequences, and serves as the distance measurement for clustering. Results are presented for real biological sequences, comparing the proposed algorithm to other similar tools available.

AB - Clustering sequences is important in a variety of applications, including development of nonredundant databases, function prediction, and identifying patterns of gene expression. Currently, clustering methods rely on a prealignment as supplementary information to guide the construction of clusters. This chapter introduces a novel algorithm to cluster nucleotide and peptide sequences. The algorithm is a no-reference approach that utilizes only the sequences as input. We also introduce a novel metric that is used to describe the relationship between biological sequences, and serves as the distance measurement for clustering. Results are presented for real biological sequences, comparing the proposed algorithm to other similar tools available.

KW - Biological sequences

KW - Clustering

KW - Databases

KW - Graph cuts

KW - Hashing

KW - Nucleotide

KW - Peptide

UR - http://www.scopus.com/inward/record.url?scp=84944559449&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84944559449&partnerID=8YFLogxK

U2 - 10.1016/B978-0-12-802508-6.00010-7

DO - 10.1016/B978-0-12-802508-6.00010-7

M3 - Chapter

AN - SCOPUS:84944559449

SN - 9780128025086

SP - 203

EP - 220

BT - Emerging Trends in Computational Biology, Bioinformatics, and Systems Biology

PB - Elsevier Inc.

ER -

CLAST: Clustering Biological Sequences

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this