E-Book
89,99 €

Molecular Data Analysis Using R E-Book

Csaba Ortutay

0,0

89,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

This book addresses the difficulties experienced by wet lab researchers with the statistical analysis of molecular biology related data. The authors explain how to use R and Bioconductor for the analysis of experimental data in the field of molecular biology. The content is based upon two university courses for bioinformatics and experimental biology students (Biological Data Analysis with R and High-throughput Data Analysis with R). The material is divided into chapters based upon the experimental methods used in the laboratories.

Key features include:
• Broad appeal--the authors target their material to researchers in several levels, ensuring that the basics are always covered.
• First book to explain how to use R and Bioconductor for the analysis of several types of experimental data in the field of molecular biology.
• Focuses on R and Bioconductor, which are widely used for data analysis. One great benefit of R and Bioconductor is that there is a vast user community and very active discussion in place, in addition to the practice of sharing codes. Further, R is the platform for implementing new analysis approaches, therefore novel methods are available early for R users.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 541

Veröffentlichungsjahr: 2016

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Cover

Title Page

Foreword

Preface

Acknowledgements

About the Companion Website

CHAPTER 1: Introduction to R statistical environment

Why R?

Interacting with R

Packages and package repositories

Working with data

Basic operations in R

Some basics of graphics in R

Getting help in R

Files for practicing

Study exercises and questions

References

Webliography

CHAPTER 2: Simple sequence analysis

Sequence files

Reading sequence files into R

Obtaining sequences from remote databases

Descriptive statistics of nucleotide sequences

Descriptive statistics of proteins

Aligned sequences

Visualization of genes and transcripts in a professional way

Files for practicing

Study exercises and questions

References

Webliography

Packages

CHAPTER 3: Annotating gene groups

Enrichment analysis: an overview

Overrepresentation analysis

Gene set enrichment analysis

Files for practicing

Study exercises and questions

References

Webliography

Packages

CHAPTER 4: Next‐generation sequencing: introduction and genomic applications

High‐throughput sequencing background

Storing data in files

General data analysis workflow

Quality checking and screening read sequences

Handling alignment files and genomic variants

Genomic applications: low‐ and medium‐depth sequencing

Files for practicing

Study exercises and questions

References

Webliography

Packages

CHAPTER 5: Quantitative transcriptomics: qRT‐PCR

Transcriptome

Understanding delta Ct

Absolute quantification

Relative quantification using the ddCt method

Quality control with melting curve

Files for practicing

Study exercises and questions

References

Webliography

Packages

CHAPTER 6: Advanced transcriptomics: gene expression microarrays

Microarray analysis: probes and samples

Archiving and publishing microarray data

Data preprocessing

Differential gene expression

Creating normalized expression set from Illumina data

Automated data access from GEO

Files for practicing

Study exercises and questions

References

Webliography

Packages

CHAPTER 7: Next‐generation sequencing in transcriptomics: RNA‐seq experiments

High‐throughput RNA sequencing background

Preparing count tables

Complex experimental arrangements

Files for practicing

Study exercises and questions

References

Webliography

Packages

CHAPTER 8: Deciphering the regulome: from ChIP to ChIP‐seq

Chromatin immunoprecipitation

ChIP with tiling microarrays

High‐throughput sequencing of ChIP fragments

Analysis of binding site motifs

Files for practicing

Study exercises and questions

References

Webliography

Packages

CHAPTER 9: Inferring regulatory and other networks from gene expression data

Gene regulatory networks

Reconstruction of co‐expression networks

Gene regulatory network inference focusing of master regulators

Integrated interpretation of genes with GeneAnswers

Files for practicing

Study exercises and questions

References

Packages

CHAPTER 10: Analysis of biological networks

A gentle introduction to networks

Files for storing network information

Important network metrics in biology

Graph visualization

Files for practicing

Study exercises and questions

References

Webliography

Packages

CHAPTER 11: Proteomics: mass spectrometry

Mass spectrometry and proteomics: why and how?

File formats for MS data

Identification of peptides in the samples

Quantitative proteomics

Getting protein‐specific annotation

Files for practicing

Study exercises and questions

References

Webliography

Packages

CHAPTER 12: Measuring protein abundance with ELISA

Enzyme‐linked immunosorbent assays

Concentration calculation with a standard curve

Comparative calculations using concentrations

Files for practicing

Study exercises and questions

References

Packages

CHAPTER 13: Flow cytometry: counting and sorting stained cells

Theoretical aspects of flow cytometry

What about data?

Data preprocessing

Cell population identification

Relating cell populations to external variables

Reporting results

Files for practicing

Study exercises and questions

References

Webliography

Packages

Glossary

Index

End User License Agreement

List of Tables

Chapter 03

Table 3.1 Selected rows from the

goannot

data frame to visualize its content. The entire table contains tens of thousands of rows.

Table 3.2 The most over-represented biological process ontologies associated with the

geneList

example dataset from the

hgu95av2.db

package.

Chapter 05

Table 5.1 Known viral load in reference samples and corresponding Ct values.

Table 5.2 Sample coding in the

qpcR

package for the

ratiocalc()

function where the expression level of a housekeeping reference gene (r) and a target gene (g) is measured in a control (c) and two treatment samples (s) and three different time points.

Chapter 06

Table 6.1 Differentially expressed probes in a microarray experiment with corresponding statistics.

Chapter 07

Table 7.1 Read counts of two example genes from the

RNAseqData.HNRNPC.bam.chr14

dataset. Individual samples are represented by columns.

Table 7.2 Differentially expressed genes calculated with

edgeR

Chapter 08

Table 8.1 Peak candidates with the highest coverage from the ChIP‐seq data of Eaton's paper “Conserved nucleosome positioning defines replication origins.” The peak with ID 212 in the first row is likely an artifact with excess number of reporters mapped there.

Table 8.2 Positional weight matrix of a candidate PU.1 binding site in mouse B‐cells identified by a ChIP‐seq experiment.

Chapter 11

Table 11.1 Naming conventions of peptide fragments after fragmenting a protein at a peptide bond.

List of Illustrations

Chapter 01

Figure 1.1 The

plot()

function produces a scatter plot when it is called on a column of a data frame. Unless specified differently, the values appear in their order in the data frame itself.

Figure 1.2 Calling

plot()

on multiple columns of a data frame results a correlogram among the columns of the data frame.

Figure 1.3 The frequency distribution of 1000 random numbers generated by the

hist()

function.

Chapter 02

Figure 2.1 Dinucleotide under‐ an over‐representation of

titin

gene as calculated using different methods offered by the

seqinr

package. (a) Rho statistics, (b) Zero‐centered rho statistics, (c) Z‐score with “codon” dinucleotide formation model, and (d) Z‐score with “base” dinucleotide formation model.

Figure 2.2 The effect of the window size (a) 1000, (b) 5000, (c) 10000, and (d) 50000 in the sliding window calculation of the GC content of the

titin

gene.

Figure 2.3 Location of amino acids with different physicochemical properties in human rhodopsin protein.

Figure 2.4 Hydropathy of human rhodopsin protein calculated by the sliding window approach. Arrows mark the hydrophobic regions corresponding to transmembrane helices.

Figure 2.5 Genomic and transcriptomic features of human

FURIN

gene visualized by the

GenomeGraphs

package.

Chapter 03

Figure 3.1 The relationship of the over‐represented biological process ontologies visualized as a directed acyclic graph.

Figure 3.2 Gene set enrichment analysis plots of running enrichment score of simulated ranked gene lists where an annotation is associated with (a) the top, (b) the bottom, (c) or the middle of the gene list. Plot (d) shows score for the annotation term without association. For practical purposes, only cases (a) and (b) are valuable.

Figure 3.3 Representation of the changes in Running Enrichment Score of cancer pathways for a gene list. The term “cancer pathways” is associated to genes on the top of the list.

Chapter 04

Figure 4.1 An overview of the key steps of Solexa sequencing.

Figure 4.2 The relationship of quality score and the probability of erroneous base identification in a DNA sequencing experiment.

Figure 4.3 Different versions of encoding quality scores as ASCII characters. The + signs mark the range used by different companies to encode scores with different algorithms. For example, quality score 9 in Illumina pipeline version 1.3–1.4 uses Phred+64 encoding, corresponds to value 73, which is encoded as character “I”. The same quality score 9 corresponds to value 42 in Sanger and current Illumina Phred+33 coding that is signed as character).

Figure 4.4 Visualization of the genomic region of human

BRCA1

gene with alternative transcripts and the reads from the example dataset.

Chapter 05

Figure 5.1 The principles of polymerase chain reaction (PCR).

Figure 5.2 Change of fluorescent intensity during a RT‐PCR reaction.

Figure 5.3 Change of fluorescent intensities of diluted samples with technical parallels.

Figure 5.4 Grayscale version of the efficiency plot of RT‐PCR measurements. Original measurements are represented by circles. The fitted model and its derivative curves are shown to illustrate efficiency and threshold calculations.

Figure 5.5 Ct calculation with a common threshold from diluted samples with technical parallels.

Figure 5.6 Viral load calculation with RT‐PCR. The known viral load value of reference samples are represented by circles, mathematical model is represented by a dashed line, and unknown samples used in the calculations are represented by diamonds.

Figure 5.7 Efficiency estimation using the

ratiocalc()

function. Three different methods (a–c) are used to determine the confidence interval of the efficiency.

Figure 5.8 Differential gene expression results in a complex experimental arrangement using three time points in two treatment animals as determined by RT‐PCR.

Figure 5.9 Melting curve analysis of a single sample.

Chapter 06

Figure 6.1 The principles of a simple gene expression microarray experiment.

Figure 6.2 Plots to assist quick visual inspection of microarray data. (a) Raw image of the array; (b) distribution of raw intensities in a single sample; (c) distribution of raw intensities in all samples; (d) log intensities in all samples.

Figure 6.3 Comparison of the log intensities in microarray samples before and after normalization.

Figure 6.4 Principal component analysis of gene expression microarrays. Samples are clearly separated by both genotype (left) and stimulation length (right).

Figure 6.5 Grayscale version of gene expression heatmap from a microarray experiment. Arrays (columns) and differentially expressed probes (rows) are clustered so that it is easy to visually detect the association of sample and gene groups.

Chapter 07

Figure 7.1 Correlation of gene‐wise dispersion as calculated with

edgeR

and

DESeq

. Dashed line represents equal values.

Figure 7.2 Fold change of gene expression as a function of read counts. Genes with significant differences between knockout and treatment samples are highlighted with stars.

Figure 7.3 Comparison of fold change values and adjusted

‐values calculated with

edgeR

and

DESeq

Figure 7.4 Heatmap representation of gene expression in different samples. The clustering of samples (in columns) and genes (in rows) is visualized by dendograms on the top and left, respectively.

Chapter 08

Figure 8.1 Distribution of immunoprecipitation ratios in an investigated samples. Since measured values show actually the logarithm of the ratios, a value close to zero means equal measured binding in the treatment and control conditions.

Figure 8.2 Binding peaks of Isw1 protein in yeast investigated by two‐color tiling array. The plot shows a segment of chromosome 4. Horizontal bars show regions that could be subjects of further investigation.

Figure 8.3 Using sliding window approach to smooth the immunoprecipitation peaks. The arrows show the most prominent peaks that are probably coming from single reporters. The smoothed curve represented by dashed line can be carried further in the analysis.

Figure 8.4 Correlation of immunoprecipitation ratios in three independent biological samples. Correlation values are shown on the lower triangle of the sample matrix, while correlograms appear on the upper triangle.

Figure 8.5 Comparison of the binding peaks of Isw1 protein in three independent biological samples.

Figure 8.6 Coverage plot of ChIP‐seq reads representing nuclosome position on chromosome 14 of yeast.

Figure 8.7 Gray scale version of a sequence logo representing a candidate PU.1 binding site in mouse B‐cells.

Chapter 09

Figure 9.1 Distribution of different metrics describing co‐expression of genes. (a) The Pearson correlation of all possible edges in the co‐expression network; (b) filtered Pearson correlations including only those edges that are significant; and (c–f) non‐rejection rates with orders 1, 3, 5, and 7.

Figure 9.2 Comparing the proportion of highly correlating edges in a co‐expression networks using the Pearson correlation and non‐rejection rates with orders 1, 3, and 5.

Figure 9.3 Graph density as calculated using different ordered non‐rejection rates. The lower the threshold value, the stricter the selection for valid edges, the lower the density of the remaining network. Observe that the higher order is used for calculating non‐rejection rates; the fewer edges pass any given threshold resulting in lower density networks.

Figure 9.4 The largest connected components of co‐expression networks based on the Pearson correlation coefficient (a) and third‐order non‐rejection rate (b) of gene expression. Only components with at least four genes are shown.

Figure 9.5 First‐order neighbors of

gene in the co‐expression network calculated from the Pearson correlation coefficient (a) and third‐order non‐rejection rate (b) of gene expression (see Figure 9.4).

Figure 9.6 Distribution of the raw expression values (a) and their logarithm (b) of selected transcription factors in a microarray dataset.

Figure 9.7 Master regulators related to the transcriptional network of selected transcription factors. Nodes in the networks identified either by micorarray probe IDs (a) or the corresponding gene symbols (b).

Figure 9.8 Visual aids to interpret gene expression data using the

GeneAnswers

package. Pie chart of enriched KEGG pathways (a), concept networks connecting dominant pathways and corresponding genes using

‐values (b) or gene numbers (c) to highlight the most important pathways.

Figure 9.9 Heatmap representation of gene expression data and genes differentially expressed in enriched pathways.

Chapter 10

Figure 10.1 Most important concepts of networks. Nodes represent entities, while connections called “edges” between them represent their relationships. Edges can be directed (arrows) or not (lines). Weights or other values can be associated with edges, which can be represented by edge labels, or the thickness of the edges.

Figure 10.2 Representing the protein interactions of small GTPases by an undirected network.

Figure 10.3 Gene regulatory network of the floral development in

Arabidopsis thaliana

Figure 10.4 The

Arabidopsis

gene regulatory network with activating and repressing gene interactions visualized by arrow color and arrow labels.

Figure 10.5 Random networks of 200 nodes generated using different algorithms.

Figure 10.6 Predicted interactions in human proteome. The 500 interactions with the highest prediction scores are shown (a). The largest connected component (b) contains half of the proteins and 65% of all predicted interactions.

Figure 10.7 Six different networks used in demonstrating the calculation of network metrics. The three networks on the top are inferred from biological data, while on the bottom there are three random networks generated for modeling purposes.

Figure 10.8 Global efficiency of biological and model networks on Figure 10.7.

Figure 10.9 Degree distribution of biological and model networks on Figure 10.7.

Figure 10.10 Vulnerability of nodes in the

Arabidopsis

gene regulatory network. The darkness of the node correlates with its vulnerability measure with LFY being the most vulnerable node.

Figure 10.11 Community structure of the human protein interaction network. (a) Nodes with a clustering coefficient greater 0.6 are colored black. Communities are identified by a fast greedy algorithm (b), random walks (c), and simulated annealing with a spin‐glass model (d).

Figure 10.12 Values associated by vertices visualized by the size of nodes.

Figure 10.13 Values associated by vertices visualized by the line type of edges.

Figure 10.14 Comparison of different layout algorithms to visualize the human protein interaction network. (a) Random, (b) circular layouts, and layouts optimized by the Kamada–Kawai (c) and Reingold–Tilford (d) algorithms.

Chapter 11

Figure 11.1 Peaks of a single scan in a mass spectrometry measurement.

Figure 11.2 Fragmentation of a protein at a peptide bond.

Figure 11.3 Noise filtering in an MS/MS spectrum. The majority of the peaks of the original spectrum (a) originates from random fluctuations in the detector, not molecules in the sample. The best peaks (b) represent real data, and they can be used in subsequent calculations.

Figure 11.4 Identification of amino acids from the mass differences of peaks. Consecutive segments are representing a “b ladder,” such as the segments of the VYK peptide.

Figure 11.5 The distribution of peptide abundance in six samples from a tandem mass spectrometry experiment. (a) Raw peptide mass and (b) normalization by the sum of the masses. The variance of the distributions can be controlled either on the raw data (c), or the sum normalized data (d).

Figure 11.6 Abundance of selected proteins in six samples from a tandem mass spectrometry experiment.

Figure 11.7 Heatmap showing the similarities between individual samples (columns) and proteins (rows) based on protein abundances originating from a tandem mass spectrometry experiment.

Chapter 12

Figure 12.1 The principles of direct (a) and sandwich (b) enzyme‐linked immunosorbent assays.

Figure 12.2 Well‐to‐well variance of optical density in a typical ELISA measurement. Grayscale version, as the observed colors are usually different shades of orange.

Figure 12.3 Linear model as standard curve in ELISA experiments. Though a linear model approximates closely the measured optical densities (a), the residuals are not random (b) beckoning a systematic bias in the analysis. Samples with suspiciously large residuals are marked by sample IDs on the plot.

Figure 12.4 Comparing the performance of linear and four‐parameter logistic models in an ELISA experiment.

Figure 12.5 Changes of IL‐2 protein production of T‐cells measured by ELISA.

Figure 12.6 Comparing the effect on different treatments on the IL‐2 protein production of T‐cells in Furin knockout and wild‐type mice.

Chapter 13

Figure 13.1 Excitation and emission spectra of green fluorescent protein.

Figure 13.2 Excitation and emission spectra of green fluorescent protein (a) and R‐phycoerythrin (b). Hatched boxes show the wavelengths of respective detectors. The gray area shows where the excitation spectrum of GFP “spills” into the detector of R‐PE.

Figure 13.3 Flow cytometry analysis of human peripheral blood cells. The FSC–SSC plot shows the most important white blood cell populations targeted in numerous studies.

Figure 13.4 Marginal events on an FSC‐SSC plot. 3248 detected cells accumulated at the maximal sensitivity of the FSC and the SSC detectors. Those are highlighted by circles on the top and right margin of the plot.

Figure 13.5 A subset of samples from a workflow object visualized using

xyplot()

function after compensation.

Figure 13.6 Ellipsis gates applied on the FSC–SSC plot select living leukocytes in Cre infected blood samples.

Figure 13.7 Polygonal gates applied on the FSC–SSC plot select living leukocytes in Cre infected blood samples. It is possible to select cell populations more exactly using polygon gates.

Figure 13.8 The proportion of IL‐2‐expressing cells among Cre‐ and Migr‐infected samples.

Guide

Cover

Table of Contents

Begin Reading

Pages

iii

xiii

xiv

xxi

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

Molecular Data Analysis Using R

Csaba Ortutay

Zsuzsanna Ortutay

Published by John Wiley & Sons, Inc., Hoboken, New JerseyPublished simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication data applied for:

ISBN: 9781119165026

Cover image: KEVIN CURTIS/Gettyimages

To our children: Botond, Balázs, Virág, and Luca

Foreword

Do you need to find out the statistical significance of observations? Are you looking for a suitable statistical test for gene expression analysis? Or just struggling with making sense of a set of enzyme‐linked immunosorbent assay (ELISA) experiments? This book, written by the husband and wife team, is a very important and useful one. It is about data analysis with widely used R software environment.

When I started my career as a young bioinformatician at the dawn of the field, there were not many books available. For many years, I had a small, indeed very small, library of all published books in bioinformatics. Now there are so many books that one starts to doubt whether there is the need for all the new ones. However, that does not apply to this book. As probably the first dedicated professor in bioinformatics in Nordic countries, I was constantly looking for new text books to use in our research group as well as for teaching. I just wish that this volume would have been available a long time ago. It would have helped on numerous instances and saved time.

Csaba Ortutay came to my group in Finland as a postdoctoral fellow in the year 2004 to work for 2 years. In the end, he remained for years. During that time, I learned to know the couple and their children. Csaba turned out to be a resourceful, meticulous, and innovative scientist who participated actively in the work of the entire group. I gave him at one time a side project, which turned out to his major project for many years. I thought that it would take several months for him to get the work done; instead, he came back with results in few weeks. We investigated immunome—the entirety of genes and proteins essential for immune system. After I left for my current position, Csaba took care of my position for a while including the International Master's Degree Programme in Bioinformatics. Csaba and Zsuzsanna complement each other in many ways, which is visible in the contents of the book.

One characteristic of Csaba is evident in this text. He is a terrific teacher. He can understand the problems of newcomers, whether students or scientists in need of novel knowledge and methods. In addition, the couple together brings the text and examples to the practical level so that also those primarily working on wet lab instead of dry lab can easily obtain the essentials of methods and apply them to their own problems.

Science is increasingly done in big multidisciplinary teams, which is great, and it addresses many difficult research areas. The downside is that many tasks are done by experts. This applies often to statistics. Therefore, other scientists may find it difficult to test their ideas since they are not familiar with the principles and practice of statistics. This book provides the necessary information and confidence to try their own hands at data analysis. By being able to investigate the datasets oneself, it will become possible to address new questions and maybe even find some completely new correlations and properties. The book covers a wide array of research topics dependent on statistics, ranging from sequence analysis and enrichment analysis to next‐generation sequencing (NGS) data analysis, from gene and protein expression to networks they form, and from immunoassays to cell sorting. Studies in none of these areas are no more possible to get published without professionally made, detailed, and thorough statistical analysis. Statistics provides the toolbox to tease out the significant findings from large, and not often so large, datasets.

To find answers to those questions, turn the page.

Mauno VihinenProfessorLund UniversitySweden

Preface

During the time when I was responsible for directing the International Master's Degree Programme in Bioinformatics at the University of Tampere, Finland, I was actively monitoring the trends of relevant job markets to see what new subjects are beneficial to our students. Recently, the rapid spread of high‐throughput analytic methods in molecular biology initiated new trends in the bioinformatics scene, which earlier focused mostly on sequences, databases, protein structures, and the related algorithms. The term “big‐data,” already well known in computer science and business analytics, started to spread in the context of genomics, medical informatics, and other molecular biology‐related fields.

Since the amount of data coming from these new experiments is several orders of magnitudes larger than what was experienced earlier, often completely new approaches are needed for handling them. Data have to be collected, stored, managed, and analyzed. During the past few years, this field developed dynamically, and several analysis tools and approaches were developed, especially in relationship with NGS and the derived technologies. On the data analysis and visualization front, R and Bioconductor established a solid market share. Research groups and industrial employers started to seek people who have experience of this language, driving the value of R skills up.

R and Bioconductor libraries are developed very quickly and they become more and more complex. It is not a trivial task for self‐learners to develop these skills on their own. Though more and more online tutorials and how‐tos are available on different online forums and developer sites, they are often obsolete or too narrowly targeted. It is increasingly problematic to piece together a complete analysis from the material available online, especially for someone, who is a novice in the world of R or molecular data analysis.

R itself has a relatively flat learning curve, meaning a lot of efforts should be dedicated until “production ready” proficiency is reached. The complexity of molecular biology‐specific algorithms, often unfamiliar to those from outside of this field, only adds to these challenges. The net results of these effects are that it is increasingly hard for new learners to enter into the field in a time, when demand is growing dynamically.

Motivated by this situation, I decided to introduce a course series in our program in Finland, teaching our students the basics of R using molecular biology‐related case situations often coming from real‐life laboratory data. Later on, a first course was continued by another one extending the foundations to high‐throughput applications with a strong focus on NGS data.

During the time period when these courses were still under development, I have got more and more inquires from wet‐lab researchers working in molecular biology laboratories, among others from my wife and her colleagues, about how to analyze diverse datasets in R, which are becoming more and more popular among them too. This was the point where we started to think about how to make these courses available to the public. Today, we deliver our R courses to universities of three countries in Europe, and many individuals via our e‐learning services.

This book offers the content of our courses arranged by the molecular biology topics. This structure serves readers who already have some prior experiences with R language as well as who are new to this segment of biodata analysis.

My recommendation for those who wish to learn R without much prior experience is to start with the chapters covering methods producing smaller amounts of data, such as Simple sequence analysis (Chapter 2), Annotating gene groups (Chapter 3), Quantitative transcriptomics: qRT‐PCR (Chapter 5), and Measuring protein abundance with ELISA (Chapter 12). In these chapters, the readers can practice the most frequently used approaches in R, and they can become ready for handling larger datasets, such as those from microarray‐ or NGS‐based experiments.

All the chapters begin with providing the necessary amount of theoretical background about the experimental methods at hand. We aimed at distilling the available large volumes of theories to the bare essentials, which are still needed to interpret the analyzed data correctly. In my opinion, one pitfall with handing the analysis of molecular biology data to information science experts is that they often neglect the constraints coming from the experimental setups themselves.

Later on, detailed practices are explained with code examples with help also with the interpretation of the produced results. Ideally, the reader should be able to see the complete workflow of an analysis type from the raw data until the presentation of the results. According to our experience, this structure offers the learners the most versatile benefits.

The code example and the data files used in this book are available from GitHub together with scripts used to generate the plots on these pages. Since the used libraries and best practices are developed continuously, the online material is continuously updated to comply with these changes:

https://github.com/csortu/MDAuR

Finally, I am convinced that the readers of this book will learn skills from these chapters, which will help them to advance their research career, equip them with competitive skills on the job market, and in general, advance the molecular biology field by teaching people about these tools.

Csaba Ortutay, PhDCEO, HiDucator Ltd.Adjunct Professor of Bioinformatics,University of TampereFinland

This book is the result of scientific discussions over the dinner table. As a wet‐lab researcher, I face several difficulties in planning, carrying out, and—last but not least—analyzing scientifically significant experiments. As any couple, we discuss everyday happenings during family dinner, such as what is going on in the workplace—in the lab in my case. I feel lucky that I can get advice when I feel stuck with my analyses. I have somewhere to turn when the results just do not look like I thought they should. Are those results I am waiting for really among the data? Several times we got experimental data, but I am not sure how to interpret them. How to generate publication‐ready figures from those microarray data? Should I first calculate the average and then fit a curve to my data points, or vice versa? What software should I use to calculate if there is a significant difference between the Ct values originating from a quantitative real‐time polymerase chain reaction (qRT‐PCR) experiment comparing treated and control cells? Csaba's answer is always the same: Why don't you use R? So, I started to get acquainted with R. It was strange and frightening at the beginning, but luckily I had a very good and a friendly teacher, a teacher from the family. I asked and asked and asked, and he answered and answered and answered. I even started to ask questions on behalf of my colleagues, so the idea of this book was born. Why do we not help others like me? I came up with different types of experiments going on in our lab, and we went through the analysis process step by step together. I asked those questions that a typical wet‐lab researcher would ask and Csaba supplemented the list of questions from the computer scientists and bioinformatics students' points of view. Now, when we are at the end of the answering process and the editing of the text, I feel satisfied and proud. I have learned how to use R for counting statistics, plotting results, and generating figures. It is still not easy for me, but now I have this book to look for solutions and examples, when I am stuck. I recommend this book to those who want to understand the hows and whys of the data analysis process; for wet‐lab researchers, who wish to analyze the experimental data themselves and do not want to wait for data analysts; and for students, either from the field of molecular biology or bioinformatics, who will use R in their career. I also recommend this book to computational scientists who got experimental results to analyze but who have no clue what the experiment is all about. I recommend this book for you, dear reader. I hope you will enjoy it and get useful tips and solutions for your problems. Feel free to experiment with the provided data; try not only those demonstrated solutions, but also write commands yourself. Make mistakes since from those you can learn the most! Look for new commands and try out what you have just learned on your own data! And most of all, enjoy your journey in the world of data analysis!

Zsuzsanna Ortutay, PhDPostdoctoral Researcher,University of TampereFinland

Acknowledgements

Before we started to assemble the material of this book, we were sure that this process will require a considerable amount of work. Without the assistance of a large number of friends, colleagues, and helpful people, this journey would have been much harder for us, perhaps even fruitless. We would like to thank everyone who supported our efforts with ideas, reading our manuscript, or encouraging us along this road.

While writing, we consulted people who have first‐hand experience of the area in question, and who were kind to offer their comments on our text. Dr Martti Tolvanen from the University of Turku, Finland, helped us with the Chapter 3; Ms Milena Doroszko from the University of Turku, with Chapter 5; and Dr Laura Pöyhönen with Chapter 13. Dr Anna Grönholm and Harlan Barker also assisted us with their comments on multiple parts and chapters.

The practical parts of this book could not be complete without appropriate datasets. While it is easy to obtain data for some methods from public databases, it is virtually impossible to get raw data for others. The example datasets for Chapters 5, 12, and 13 originate from Zsuzsanna's laboratory. We are grateful for her supervisors, Adjunct Professor Marko Pesu, and Dr Ilkka Junttila from the University of Tampere, Finland, for giving us permission to use the raw data from their published and even unpublished works.

Csaba's former principal investigator from his postdoc times, Professor Mauno Vihinen from Lund University, Sweden, helped us to find and contact our publisher. He has also offered his help by writing Foreword to our book. Thank you Mauno for all of your support, without which this book couldn't have been written.

At last, but not least, we are grateful to editor Mindy Okura‐Marszycki, who offered her helpful comments on practical issues. As we are inexperienced in the world of book publishing, these thoughts were infinitely valuable for assisting our book‐writing effort.

About the Companion Website

Don’t forget to visit the companion website for this book:

Scan this QR code to visit the companion website

www.wiley.com/go/ortutay/molecular_data_analysis_r

There you will find valuable material designed to enhance your learning, including:

Data

Figures

Scripts

CHAPTER 1Introduction to R statistical environment

Why R?

If you work in the field of biodata analysis, or if you are interested in getting a bioinformatics job, you can see a large number of related job advertisements targeting young professionals. There is one common topic coming back in those ads: they demand “a high degree of familiarity with R/Bioconductor.” (Here, I am quoting an actual recent ad from Monster.com.)

Besides, when we have to create and analyze a large amount of data during our bio‐researcher career, sooner or later we realize that simple approaches using spread sheets (aka the Excel part of MS Office) are not flexible anymore to fulfill the needs of our projects. In these situations, we start to look for dedicated statistical software tools, and soon we encounter the countless alternatives from which we can choose. The R statistical environment is one among the possibilities.

With the exponential spread of high‐throughput experimental methods, including microarray and next‐generation sequencing (NGS)-based experiments, the skills related to large‐scale analysis of data from biological experiments have higher and higher value. R and Bioconductor offer a free and flexible tool‐set for these types of analyses; therefore, many research groups and companies select it as their data analysis platform.

R is an open‐source software licensed under the GNU General Public License (GPL). This has an advantage that you can install R for free on your desktop computer, regardless of whether you use Windows, Mac OS X, or a Linux distribution.

Introducing all the features of R thoroughly at a general level exceeds the scope and purpose of this book, which is to focus on molecular biology‐specific applications. For those who are interested in a deeper introduction into R itself, it is suggested reading the book R for Beginners by Emmanuel Paradis as a reference guide. It is an excellent general guide, which can be found online (Paradis 2005). In the course, we use more biology‐oriented examples to illustrate the most important topics. The other recommended book for this chapter is R in a Nutshell by Joseph Adler (2012).

Installing R

The first task of analyzing data with R is to install R on the computer. There is a nice discussion on the bioinformatics blogs about why people so seldom use their knowledge acquired on short bioinformatics courses. One of the main considerations points out that it is because the greatest challenge is to install the software in question.

There are plenty of available information on the web about how to install R, but the most authentic source is the website of the R project itself. In this page, the official documentation, installer, and other related links from the developers of R themselves are collected. The first step is to navigate to the download section of the page and find the mirror pages closest to the location of the user.

However, there are some differences in the installation process depending on the operating system of the computer in use. Windows users should find the Windows installer to their system from the download pages. It is useful to check for the base installer, not the contributed libraries. In the case of a Linux distribution, R can be installed via the package manager. Several Linux distributions provide R (and many R libraries) as a part of their repositories. This way, the package manager can take care of the updates. Mac OS X users and Apple fans can find the pkg file containing the R framework, 64‐bit graphical user interface (GUI) (R.app) and Tcl/Tk 8.6.0 X11 libraries for installing the R base systems on their computer. Brave users of other UNIX systems (i.e., FreeBSD or OpenBSD) can use R, but they should compile it from the source. This is not a beginner topic. In the case of a computer owned by a company, university, or library, the installation of R (just like many other programs) requires most often superuser rights.

Interacting with R

The interface of R is somewhat different from other software used for statistics, such as SPSS, S‐plus, Prism, or MS Excel (which is not a statistical software tool!). There are neither icons nor sophisticated menus to perform analyses. Instead, commands should be typed in the appropriate place of R called the “command prompt”. It is marked with >. In this book, the commands for typing into the prompt are marked by fixed‐width (monospaced) fonts:

> citation()

After typing in a command (and hitting Enter), the results turn up either under the command or, in case of graphics, in a separate window. If the result of a command is nothing, the string NULL appears as a result. Mistyping or making an error in the parameters of a command leads to an error message with some information about what was wrong.

> c()

NULL

> a * 5

Error: object 'a' not found

From now on, we will omit the > prompt character from the code samples so you can just copy/paste the commands. Leaving R happens with the quit() function.

quit(save='no')

q()

Graphical interfaces and integrated development environment (IDE) integration

A command‐line interface is enough for performing the practices. However, some prefer to have GUI. There are multiple choices depending on the operating system in use. The Windows and Mac versions of R starts with a very simple GUI, while Linux/UNIX versions start only with a command‐line interface. The Java GUI for R is available for any platform capable of running Java, and it sports simple, functional menus to perform the most basic tasks related to an analysis (Helbig, Urbanek, and Fellows 2013).

For a more advanced GUI, one can experiment with RStudio or R Commander (Fox 2005). There are several plugins to integrate R into the best coding production tools, such as Emacs (with the Emacs Speaks Statistics add‐on), Eclipse (by StatET for R), and many others.

Scripting and sourcing

Doing data analysis in R means typing in commands and experimenting with parameters suitable for the given set of data. At a later stage, the procedure will be repeated either on the same data with slight modifications in the course of the analysis, or on different data with the same analysis. For example, the analyzed data are submitted to publication, but the manuscript reviewers request slight modifications in the analysis. It means to repeat almost the entire process, but parameter x should be 0.6 instead of 0.5 as used earlier.

Scripts are used to register the steps of an analysis. Scripts are small text files containing the commands of the analysis one after the other, in the same order as are issued during the data processing. Traditionally, we use “.R” extension (instead of .txt) for these text files to mark that these are R script files. Script files are the solution for

archiving an analysis,

automate tasks that take much time to run.

Script files can easily be included into an analysis flow called “sourcing” (the term is borrowed from other computer script languages) by issuing the source() command. For example, let's have the following script file my_first_script.R:

a<-rep(5,5)

b<-rnorm(5)

print(a)

print(b)

print(a*b)

Scripts can be created using any text editor (i.e., gedit, mcedit, Notepad) but not with a word processor software (i.e., MS Word, LibreOffice Writer, and iWork Pages) unless it is possible to save it as a text file, and not a .doc, .docx, .odt, or any other more complex formats. R should be led to the location where the saved file can be found:

setwd('/home/path/to/your/files') #on Linux/UNIX

setwd('/Users/User Name/Documents/FOLDER') #on Mac

setwd('c:/path/to/my/directory/') #on Windows

The working directory can be checked using the getwd() command.

Loading the script file in the working directory is simple:

source('my_first_script.R')

If the script is somewhere else, the path is desired:

source('/path/to/my_first_script.R')

The R history and the R environment file

When starting R first time, it creates two files to register what was done: the history and the environment file. If R was started from the command line, these files are saved in the directory where R was started. Launching R by an icon results in saving the history and the environment file to a default place.

The history file is a text file that saves all the commands issued in a session with R, while the environment file holds the data used during the session. It is worth saving these files for further use with the savehistory(file = “/path/to/Rhistory”) and save.image(file = “/path/to/RData”) commands. When exiting R by using the q() command, it asks whether you want to save these to the default places. Choosing this option leads to start R from the same directory next time, and it will also remember the past work and data.

Packages and package repositories

Statistics is a huge field, and many disciplines use it for their specific purposes. All of them have different needs, flavors, and data types specifically designed for their needs. It would be meaningless and hopeless to put everything into a single software. Honestly, the majority of the code would never be used as a bioinformatician rarely uses statistics designed for particle physics, likewise a computational chemist rarely reads in data from gene expression microarrays.

To address this problem, R developers have decided to provide only a common framework and some basic functionality as part of the base installation, and subject‐specific elements are organized into code bags called “packages.” In reality, the base installation of R is not very useful for molecular data analysis. Its most useful part is that suitable packages can be found for most of the often applied analysis types.

R packages are collected into so‐called package repositories on the web. These places are dedicated to the maintenance and distribution of the packages. The concept is probably familiar to Linux users. R uses its own internal code called “package management system” to find, install, and update packages. There are two important package repositories, which are also used in this book: Comprehensive R Archive Network (CRAN) and Bioconductor.

Comprehensive R Archive Network

(R Core Team 2015) is a place for general‐purpose packages, but many biology‐related packages can be found here too. One can search packages related to the topic of interest (left side of the page, Software/packages/table of available packages, sorted by name) by keyword search. For example, if some biological sequences‐related packages are required, searching (Ctrl+F) for the keyword “biological sequence” on this page will result in those soon.

Here, we introduce the sequences package (Gatto and Stojnic 2014). Clicking on the name of the package leads to a general information page. The most relevant documents here are the Vignettes (if they are available), providing a quick introduction to the package, and the reference manual that shows an extensive explanation for all the commands and datasets provided by the package.

Installing and managing CRAN packages is best done within R itself. Most GUIs provide some assistance for package management in the “Packages” menu. It is simple to install packages using the install.packages() command. Picking up the packages and their dependencies requires Internet access. The installation process can take much time if the selected package has many other packages to depend on.

install.packages("sequences")

On Linuxinstall.packages() works properly if it is issued in an R session of the root, or if a library is specified as the package directory to write:

install.packages("sequences", lib="/home/mydir/Rpackages/")

The full list of available packages can be checked using available.packages(). This command lists all the packages compatible with the version and operating system on the computer in use. It often means many thousands of packages.

ap<-available.packages()

row.names(ap)

Loading a successfully installed package (e.g., the sequences package in the previous example) is done using the library() command (without quotation marks around the package name this time).

library(sequences)

Bioconductor

There is another R package repository dedicated mostly to the analysis of high‐throughput data from molecular biology, called “Bioconductor” (Gentleman et al. 2004). It contains more than 1500 packages dedicated to this exciting field of bioinformatics. The packages are divided into three groups:

Software—This section contains the most interesting packages that can assist different kind of analysis. This sub‐repository is roughly analog to CRAN in the sense that the packages here provide the statistical methods and procedures, such as microarray normalization functions or enrichment analysis approaches.

AnnotationData—Here is a collection of very important supporting information concerning genome, microarray platform, and database annotation. These packages are useful mostly as input data for other packages in the Software section.

ExperimentData—Prepared experimental data are available from here for further analysis. It is a good idea to test a new statistics or analysis approach on data from here first. This effort will assure that the code in use is compatible with the rest of Bioconductor's framework.

The packages are listed in a logical and hierarchical system, and it is relatively easy to find relevant packages for a certain type of analysis. For example, if mass spectrometry is in the focus of interest, the relevant packages can be found in the Software ‐> Assay Technologies ‐> Mass Spectrometry branch of the hierarchy; while in case of inferring networks from experimental data, the Software ‐>Bioinformatics ‐> Networks ‐> Network Inference branch should be checked. The vignette and the reference manual appears in a similar way on the dedicated page of the chosen package as it is in CRAN.

There is another, perhaps even more practical, way to find suitable packages from Bioconductor. There are complete recipes for more popular data analysis tasks in the Workflows section of the Bioconductor page Help menu, which not only shows the needed packages but also demonstrates how to use them.

Bioconductor uses its own package management system that works somewhat differently than the stock R system. It is based on a script titled biocLite.R, which can be sourced directly from the Internet:

source("

http://bioconductor.org/biocLite.R

This script contains everything needed for managing Bioconductor packages. For example, to install the affy package (Gautier et al. 2004), the biocLite() command should be called:

source("

http://bioconductor.org/biocLite.R

biocLite("affy")

This command processes the dependencies and installs everything in need. The annotation and experimental data packages tend to be huge, so high‐speed Internet (or a lot of patience) and sufficient amount of disk space is needed to install them.

Loading of the installed packages happens in the same way as with CRAN packages:

library(affy)

Working with data

For a data analysis project, well, data are needed. It is a crucial question, how to load data into R, and that is often the second biggest challenge for a newbie bioinformatician. Several R tutorials start to explain this topic by introducing the c(), edit(), and fix() commands. These are commands and functions used to type in numbers and information in a tabular format. Also, these are the commands that are rarely used in a real‐life project. The cause of this is simple: no‐one type in the gene expression values of 40,000 gene probes for a few dozens of samples.

Most often data are loaded from files. Files might come out from databases, from measurement instruments, or from another software. Often data tables are assembled in MS Excel. MS Excel or other spreadsheet software can also export data tables as .csv files, which are easy to load to R. Depending on the operating system in use and the exact installation of R, there are multiple possibilities to read .xls files. The package gdata (Warnes et al. 2015) contains the read.xls() command, which can access the content of both .xls and .xlsx files:

library(gdata)

my.data<-read.xls("data_file.xlsx", sheet=1)

This code reads in a table from the first sheet in the .xlsx file into the my.data data frame. This is an excellent tool, but it requires the installation of Perl (a scripting language) on the computer. In Linux/Unix installations it is not a problem, but in Windows environments, it is not easy to solve. A universal solution of this problem is to read data from exported .csv files. This approach works for all platforms, and it does not require the installation of additional packages:

my.data<-read.csv("fdata_file.csv",sep="\t",row.names=1)

The first step is to prepare a data table using Excel or another spreadsheet software. The data are then exported to a .csv file called “tabular text file” or “comma‐separated text file” in different software. It is important to specify the usage of a tab as a field separator (sep=“\t”) in the settings instead of a comma that is usually the default field separator for .csv files.

To handle the problem of transferring data from MS Excel to R, a new package called readxl has been released recently (Wickham 2015). The ultimate goal of this package is to provide accessibility to data saved in Excel files without further dependencies, and in an operating system‐independent way.

There are many proprietary file formats coming out from different instruments. There are dedicated packages developed to read their content and load it in proper data structures in R for further analysis. For example, the ReadAffy() command from the affy package is designed to import Affymetrix GeneChip CEL files. Similarly, the read.fasta() command of seqinr package (Charif and Lobry 2007) or the readFASTA() command of Biostrings package (Pages et al. 2015) can import FASTA formatted sequence files.

This book has a dedicated support webpage. Here, all the R scripts and data are available to do all the practices discussed in the following chapters. As bonus material, the scripts used for generating the figures on these pages are also available from the same place.

Save the file furin_data.csv from the webpage of the book, and open it with a text editor. There are rows and columns of the data in the file. The first step for now is to set the exact path to the location of the file in the furin.file variable, and use the read.csv() command to read its command to the my.data variable. Checking the structure of the data happens by using the str() command.

furin.file<- '/path/to/your/file/furin_data.csv'

my.data<-read.csv(furin.file,sep="\t")

str(my.data)

Basic operations in R

All scripting languages provide simple ways to perform basic computational operations on data. R is not different in that sense. Certainly, the most basic things like arithmetic operations work as expected. For example, adding and multiplying with numbers the same way as in math classes:

4 + 7

6 * 2

Of course, R is not the most suitable choice if only a calculator is needed. R is used to store numbers, information, and data, and also to perform different tricks and calculations on those. For those, who know one or other programming languages, it is clear that variables should be used. For the sake of those who are not familiar with these issues: variables are similar to labeled “shoe‐boxes” containing data items. During an analysis the data items can be stored in these “shoe‐boxes” instead of reading them from file for each operation. The arrow mark (or assignment operator) is used for loading any data, for example, a number, into these variables.

my.data <- 5

Now the number 5 is loaded into the my.data variable. The direction in which way the arrow points tells the story. For example, the result of an operation can be stored in a variable, and later on the data inside of the variable can be the subject of further operations.

my.data <- 5 + 3

my.other.data <- my.data * 2

Typing the name of the variable will show what is inside it:

>my.data

>my.other.data

The variables in R can store a great many types of different things like numbers, list of numbers, strings, sequences, data tables, data matrices, entire genomes, or multiple sequence alignments. Several operations have different meanings depending on what kind of data are applied to them. R is smart enough to figure out if a command has a different version specifically fit for a particular data type.

a <- 5

a + 3

b <- c(5,6,7)

b + 3

In the previous example, there are two very different variables: a and b. Variable a holds a single number (5), while variable b holds a vector of three numbers. The addition (+) operator guesses that applying it to a single number (variable a), it should add 3 to a single number. However, in case of vector (b), it will add 3 to all the numbers in the vector. This distinction is crucial, as the result of the first operation is a single value, while the result of the second one is a vector itself.

R has this kind of smart redundancy that is especially handy with the plot() command. There are several data types that are represented in graphs and figures, which are very often generated by the plot() command. Specific data types have their specific plots, and R packages are well prepared to draw different plots for them. Using the furin_data.csv file again as an example, different graphs can be generated by of plotting one column of the data table (Figure 1.1), or all of them for checking their correlation (Figure 1.2):

furin.file<- '/path/to/your/file/furin_data.csv'

my.data<-read.csv(furin.file,sep="\t")

plot(my.data$Naive.KO.1)

plot(my.data)

Figure 1.1 The plot() function produces a scatter plot when it is called on a column of a data frame. Unless specified differently, the values appear in their order in the data frame itself.

Figure 1.2 Calling plot() on multiple columns of a data frame results a correlogram among the columns of the data frame.

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben: