89,99 €
This book addresses the difficulties experienced by wet lab researchers with the statistical analysis of molecular biology related data. The authors explain how to use R and Bioconductor for the analysis of experimental data in the field of molecular biology. The content is based upon two university courses for bioinformatics and experimental biology students (Biological Data Analysis with R and High-throughput Data Analysis with R). The material is divided into chapters based upon the experimental methods used in the laboratories.
Key features include:
• Broad appeal--the authors target their material to researchers in several levels, ensuring that the basics are always covered.
• First book to explain how to use R and Bioconductor for the analysis of several types of experimental data in the field of molecular biology.
• Focuses on R and Bioconductor, which are widely used for data analysis. One great benefit of R and Bioconductor is that there is a vast user community and very active discussion in place, in addition to the practice of sharing codes. Further, R is the platform for implementing new analysis approaches, therefore novel methods are available early for R users.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 541
Veröffentlichungsjahr: 2016
Cover
Title Page
Foreword
Preface
Acknowledgements
About the Companion Website
CHAPTER 1: Introduction to R statistical environment
Why R?
Interacting with R
Packages and package repositories
Working with data
Basic operations in R
Some basics of graphics in R
Getting help in R
Files for practicing
Study exercises and questions
References
Webliography
CHAPTER 2: Simple sequence analysis
Sequence files
Reading sequence files into R
Obtaining sequences from remote databases
Descriptive statistics of nucleotide sequences
Descriptive statistics of proteins
Aligned sequences
Visualization of genes and transcripts in a professional way
Files for practicing
Study exercises and questions
References
Webliography
Packages
CHAPTER 3: Annotating gene groups
Enrichment analysis: an overview
Overrepresentation analysis
Gene set enrichment analysis
Files for practicing
Study exercises and questions
References
Webliography
Packages
CHAPTER 4: Next‐generation sequencing: introduction and genomic applications
High‐throughput sequencing background
Storing data in files
General data analysis workflow
Quality checking and screening read sequences
Handling alignment files and genomic variants
Genomic applications: low‐ and medium‐depth sequencing
Files for practicing
Study exercises and questions
References
Webliography
Packages
CHAPTER 5: Quantitative transcriptomics: qRT‐PCR
Transcriptome
Understanding delta Ct
Absolute quantification
Relative quantification using the ddCt method
Quality control with melting curve
Files for practicing
Study exercises and questions
References
Webliography
Packages
CHAPTER 6: Advanced transcriptomics: gene expression microarrays
Microarray analysis: probes and samples
Archiving and publishing microarray data
Data preprocessing
Differential gene expression
Creating normalized expression set from Illumina data
Automated data access from GEO
Files for practicing
Study exercises and questions
References
Webliography
Packages
CHAPTER 7: Next‐generation sequencing in transcriptomics: RNA‐seq experiments
High‐throughput RNA sequencing background
Preparing count tables
Complex experimental arrangements
Files for practicing
Study exercises and questions
References
Webliography
Packages
CHAPTER 8: Deciphering the regulome: from ChIP to ChIP‐seq
Chromatin immunoprecipitation
ChIP with tiling microarrays
High‐throughput sequencing of ChIP fragments
Analysis of binding site motifs
Files for practicing
Study exercises and questions
References
Webliography
Packages
CHAPTER 9: Inferring regulatory and other networks from gene expression data
Gene regulatory networks
Reconstruction of co‐expression networks
Gene regulatory network inference focusing of master regulators
Integrated interpretation of genes with GeneAnswers
Files for practicing
Study exercises and questions
References
Packages
CHAPTER 10: Analysis of biological networks
A gentle introduction to networks
Files for storing network information
Important network metrics in biology
Graph visualization
Files for practicing
Study exercises and questions
References
Webliography
Packages
CHAPTER 11: Proteomics: mass spectrometry
Mass spectrometry and proteomics: why and how?
File formats for MS data
Identification of peptides in the samples
Quantitative proteomics
Getting protein‐specific annotation
Files for practicing
Study exercises and questions
References
Webliography
Packages
CHAPTER 12: Measuring protein abundance with ELISA
Enzyme‐linked immunosorbent assays
Concentration calculation with a standard curve
Comparative calculations using concentrations
Files for practicing
Study exercises and questions
References
Packages
CHAPTER 13: Flow cytometry: counting and sorting stained cells
Theoretical aspects of flow cytometry
What about data?
Data preprocessing
Cell population identification
Relating cell populations to external variables
Reporting results
Files for practicing
Study exercises and questions
References
Webliography
Packages
Glossary
Index
End User License Agreement
Chapter 03
Table 3.1 Selected rows from the
goannot
data frame to visualize its content. The entire table contains tens of thousands of rows.
Table 3.2 The most over-represented biological process ontologies associated with the
geneList
example dataset from the
hgu95av2.db
package.
Chapter 05
Table 5.1 Known viral load in reference samples and corresponding Ct values.
Table 5.2 Sample coding in the
qpcR
package for the
ratiocalc()
function where the expression level of a housekeeping reference gene (r) and a target gene (g) is measured in a control (c) and two treatment samples (s) and three different time points.
Chapter 06
Table 6.1 Differentially expressed probes in a microarray experiment with corresponding statistics.
Chapter 07
Table 7.1 Read counts of two example genes from the
RNAseqData.HNRNPC.bam.chr14
dataset. Individual samples are represented by columns.
Table 7.2 Differentially expressed genes calculated with
edgeR
.
Chapter 08
Table 8.1 Peak candidates with the highest coverage from the ChIP‐seq data of Eaton's paper “Conserved nucleosome positioning defines replication origins.” The peak with ID 212 in the first row is likely an artifact with excess number of reporters mapped there.
Table 8.2 Positional weight matrix of a candidate PU.1 binding site in mouse B‐cells identified by a ChIP‐seq experiment.
Chapter 11
Table 11.1 Naming conventions of peptide fragments after fragmenting a protein at a peptide bond.
Chapter 01
Figure 1.1 The
plot()
function produces a scatter plot when it is called on a column of a data frame. Unless specified differently, the values appear in their order in the data frame itself.
Figure 1.2 Calling
plot()
on multiple columns of a data frame results a correlogram among the columns of the data frame.
Figure 1.3 The frequency distribution of 1000 random numbers generated by the
hist()
function.
Chapter 02
Figure 2.1 Dinucleotide under‐ an over‐representation of
titin
gene as calculated using different methods offered by the
seqinr
package. (a) Rho statistics, (b) Zero‐centered rho statistics, (c) Z‐score with “codon” dinucleotide formation model, and (d) Z‐score with “base” dinucleotide formation model.
Figure 2.2 The effect of the window size (a) 1000, (b) 5000, (c) 10000, and (d) 50000 in the sliding window calculation of the GC content of the
titin
gene.
Figure 2.3 Location of amino acids with different physicochemical properties in human rhodopsin protein.
Figure 2.4 Hydropathy of human rhodopsin protein calculated by the sliding window approach. Arrows mark the hydrophobic regions corresponding to transmembrane helices.
Figure 2.5 Genomic and transcriptomic features of human
FURIN
gene visualized by the
GenomeGraphs
package.
Chapter 03
Figure 3.1 The relationship of the over‐represented biological process ontologies visualized as a directed acyclic graph.
Figure 3.2 Gene set enrichment analysis plots of running enrichment score of simulated ranked gene lists where an annotation is associated with (a) the top, (b) the bottom, (c) or the middle of the gene list. Plot (d) shows score for the annotation term without association. For practical purposes, only cases (a) and (b) are valuable.
Figure 3.3 Representation of the changes in Running Enrichment Score of cancer pathways for a gene list. The term “cancer pathways” is associated to genes on the top of the list.
Chapter 04
Figure 4.1 An overview of the key steps of Solexa sequencing.
Figure 4.2 The relationship of quality score and the probability of erroneous base identification in a DNA sequencing experiment.
Figure 4.3 Different versions of encoding quality scores as ASCII characters. The + signs mark the range used by different companies to encode scores with different algorithms. For example, quality score 9 in Illumina pipeline version 1.3–1.4 uses Phred+64 encoding, corresponds to value 73, which is encoded as character “I”. The same quality score 9 corresponds to value 42 in Sanger and current Illumina Phred+33 coding that is signed as character).
Figure 4.4 Visualization of the genomic region of human
BRCA1
gene with alternative transcripts and the reads from the example dataset.
Chapter 05
Figure 5.1 The principles of polymerase chain reaction (PCR).
Figure 5.2 Change of fluorescent intensity during a RT‐PCR reaction.
Figure 5.3 Change of fluorescent intensities of diluted samples with technical parallels.
Figure 5.4 Grayscale version of the efficiency plot of RT‐PCR measurements. Original measurements are represented by circles. The fitted model and its derivative curves are shown to illustrate efficiency and threshold calculations.
Figure 5.5 Ct calculation with a common threshold from diluted samples with technical parallels.
Figure 5.6 Viral load calculation with RT‐PCR. The known viral load value of reference samples are represented by circles, mathematical model is represented by a dashed line, and unknown samples used in the calculations are represented by diamonds.
Figure 5.7 Efficiency estimation using the
ratiocalc()
function. Three different methods (a–c) are used to determine the confidence interval of the efficiency.
Figure 5.8 Differential gene expression results in a complex experimental arrangement using three time points in two treatment animals as determined by RT‐PCR.
Figure 5.9 Melting curve analysis of a single sample.
Chapter 06
Figure 6.1 The principles of a simple gene expression microarray experiment.
Figure 6.2 Plots to assist quick visual inspection of microarray data. (a) Raw image of the array; (b) distribution of raw intensities in a single sample; (c) distribution of raw intensities in all samples; (d) log intensities in all samples.
Figure 6.3 Comparison of the log intensities in microarray samples before and after normalization.
Figure 6.4 Principal component analysis of gene expression microarrays. Samples are clearly separated by both genotype (left) and stimulation length (right).
Figure 6.5 Grayscale version of gene expression heatmap from a microarray experiment. Arrays (columns) and differentially expressed probes (rows) are clustered so that it is easy to visually detect the association of sample and gene groups.
Chapter 07
Figure 7.1 Correlation of gene‐wise dispersion as calculated with
edgeR
and
DESeq
. Dashed line represents equal values.
Figure 7.2 Fold change of gene expression as a function of read counts. Genes with significant differences between knockout and treatment samples are highlighted with stars.
Figure 7.3 Comparison of fold change values and adjusted
‐values calculated with
edgeR
and
DESeq
.
Figure 7.4 Heatmap representation of gene expression in different samples. The clustering of samples (in columns) and genes (in rows) is visualized by dendograms on the top and left, respectively.
Chapter 08
Figure 8.1 Distribution of immunoprecipitation ratios in an investigated samples. Since measured values show actually the logarithm of the ratios, a value close to zero means equal measured binding in the treatment and control conditions.
Figure 8.2 Binding peaks of Isw1 protein in yeast investigated by two‐color tiling array. The plot shows a segment of chromosome 4. Horizontal bars show regions that could be subjects of further investigation.
Figure 8.3 Using sliding window approach to smooth the immunoprecipitation peaks. The arrows show the most prominent peaks that are probably coming from single reporters. The smoothed curve represented by dashed line can be carried further in the analysis.
Figure 8.4 Correlation of immunoprecipitation ratios in three independent biological samples. Correlation values are shown on the lower triangle of the sample matrix, while correlograms appear on the upper triangle.
Figure 8.5 Comparison of the binding peaks of Isw1 protein in three independent biological samples.
Figure 8.6 Coverage plot of ChIP‐seq reads representing nuclosome position on chromosome 14 of yeast.
Figure 8.7 Gray scale version of a sequence logo representing a candidate PU.1 binding site in mouse B‐cells.
Chapter 09
Figure 9.1 Distribution of different metrics describing co‐expression of genes. (a) The Pearson correlation of all possible edges in the co‐expression network; (b) filtered Pearson correlations including only those edges that are significant; and (c–f) non‐rejection rates with orders 1, 3, 5, and 7.
Figure 9.2 Comparing the proportion of highly correlating edges in a co‐expression networks using the Pearson correlation and non‐rejection rates with orders 1, 3, and 5.
Figure 9.3 Graph density as calculated using different ordered non‐rejection rates. The lower the threshold value, the stricter the selection for valid edges, the lower the density of the remaining network. Observe that the higher order is used for calculating non‐rejection rates; the fewer edges pass any given threshold resulting in lower density networks.
Figure 9.4 The largest connected components of co‐expression networks based on the Pearson correlation coefficient (a) and third‐order non‐rejection rate (b) of gene expression. Only components with at least four genes are shown.
Figure 9.5 First‐order neighbors of
gene in the co‐expression network calculated from the Pearson correlation coefficient (a) and third‐order non‐rejection rate (b) of gene expression (see Figure 9.4).
Figure 9.6 Distribution of the raw expression values (a) and their logarithm (b) of selected transcription factors in a microarray dataset.
Figure 9.7 Master regulators related to the transcriptional network of selected transcription factors. Nodes in the networks identified either by micorarray probe IDs (a) or the corresponding gene symbols (b).
Figure 9.8 Visual aids to interpret gene expression data using the
GeneAnswers
package. Pie chart of enriched KEGG pathways (a), concept networks connecting dominant pathways and corresponding genes using
‐values (b) or gene numbers (c) to highlight the most important pathways.
Figure 9.9 Heatmap representation of gene expression data and genes differentially expressed in enriched pathways.
Chapter 10
Figure 10.1 Most important concepts of networks. Nodes represent entities, while connections called “edges” between them represent their relationships. Edges can be directed (arrows) or not (lines). Weights or other values can be associated with edges, which can be represented by edge labels, or the thickness of the edges.
Figure 10.2 Representing the protein interactions of small GTPases by an undirected network.
Figure 10.3 Gene regulatory network of the floral development in
Arabidopsis thaliana
.
Figure 10.4 The
Arabidopsis
gene regulatory network with activating and repressing gene interactions visualized by arrow color and arrow labels.
Figure 10.5 Random networks of 200 nodes generated using different algorithms.
Figure 10.6 Predicted interactions in human proteome. The 500 interactions with the highest prediction scores are shown (a). The largest connected component (b) contains half of the proteins and 65% of all predicted interactions.
Figure 10.7 Six different networks used in demonstrating the calculation of network metrics. The three networks on the top are inferred from biological data, while on the bottom there are three random networks generated for modeling purposes.
Figure 10.8 Global efficiency of biological and model networks on Figure 10.7.
Figure 10.9 Degree distribution of biological and model networks on Figure 10.7.
Figure 10.10 Vulnerability of nodes in the
Arabidopsis
gene regulatory network. The darkness of the node correlates with its vulnerability measure with LFY being the most vulnerable node.
Figure 10.11 Community structure of the human protein interaction network. (a) Nodes with a clustering coefficient greater 0.6 are colored black. Communities are identified by a fast greedy algorithm (b), random walks (c), and simulated annealing with a spin‐glass model (d).
Figure 10.12 Values associated by vertices visualized by the size of nodes.
Figure 10.13 Values associated by vertices visualized by the line type of edges.
Figure 10.14 Comparison of different layout algorithms to visualize the human protein interaction network. (a) Random, (b) circular layouts, and layouts optimized by the Kamada–Kawai (c) and Reingold–Tilford (d) algorithms.
Chapter 11
Figure 11.1 Peaks of a single scan in a mass spectrometry measurement.
Figure 11.2 Fragmentation of a protein at a peptide bond.
Figure 11.3 Noise filtering in an MS/MS spectrum. The majority of the peaks of the original spectrum (a) originates from random fluctuations in the detector, not molecules in the sample. The best peaks (b) represent real data, and they can be used in subsequent calculations.
Figure 11.4 Identification of amino acids from the mass differences of peaks. Consecutive segments are representing a “b ladder,” such as the segments of the VYK peptide.
Figure 11.5 The distribution of peptide abundance in six samples from a tandem mass spectrometry experiment. (a) Raw peptide mass and (b) normalization by the sum of the masses. The variance of the distributions can be controlled either on the raw data (c), or the sum normalized data (d).
Figure 11.6 Abundance of selected proteins in six samples from a tandem mass spectrometry experiment.
Figure 11.7 Heatmap showing the similarities between individual samples (columns) and proteins (rows) based on protein abundances originating from a tandem mass spectrometry experiment.
Chapter 12
Figure 12.1 The principles of direct (a) and sandwich (b) enzyme‐linked immunosorbent assays.
Figure 12.2 Well‐to‐well variance of optical density in a typical ELISA measurement. Grayscale version, as the observed colors are usually different shades of orange.
Figure 12.3 Linear model as standard curve in ELISA experiments. Though a linear model approximates closely the measured optical densities (a), the residuals are not random (b) beckoning a systematic bias in the analysis. Samples with suspiciously large residuals are marked by sample IDs on the plot.
Figure 12.4 Comparing the performance of linear and four‐parameter logistic models in an ELISA experiment.
Figure 12.5 Changes of IL‐2 protein production of T‐cells measured by ELISA.
Figure 12.6 Comparing the effect on different treatments on the IL‐2 protein production of T‐cells in Furin knockout and wild‐type mice.
Chapter 13
Figure 13.1 Excitation and emission spectra of green fluorescent protein.
Figure 13.2 Excitation and emission spectra of green fluorescent protein (a) and R‐phycoerythrin (b). Hatched boxes show the wavelengths of respective detectors. The gray area shows where the excitation spectrum of GFP “spills” into the detector of R‐PE.
Figure 13.3 Flow cytometry analysis of human peripheral blood cells. The FSC–SSC plot shows the most important white blood cell populations targeted in numerous studies.
Figure 13.4 Marginal events on an FSC‐SSC plot. 3248 detected cells accumulated at the maximal sensitivity of the FSC and the SSC detectors. Those are highlighted by circles on the top and right margin of the plot.
Figure 13.5 A subset of samples from a workflow object visualized using
xyplot()
function after compensation.
Figure 13.6 Ellipsis gates applied on the FSC–SSC plot select living leukocytes in Cre infected blood samples.
Figure 13.7 Polygonal gates applied on the FSC–SSC plot select living leukocytes in Cre infected blood samples. It is possible to select cell populations more exactly using polygon gates.
Figure 13.8 The proportion of IL‐2‐expressing cells among Cre‐ and Migr‐infected samples.
Cover
Table of Contents
Begin Reading
iii
iv
v
xiii
xiv
xv
xxi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
Csaba Ortutay
Zsuzsanna Ortutay
Copyright © 2017 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New JerseyPublished simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication data applied for:
ISBN: 9781119165026
Cover image: KEVIN CURTIS/Gettyimages
To our children: Botond, Balázs, Virág, and Luca
Do you need to find out the statistical significance of observations? Are you looking for a suitable statistical test for gene expression analysis? Or just struggling with making sense of a set of enzyme‐linked immunosorbent assay (ELISA) experiments? This book, written by the husband and wife team, is a very important and useful one. It is about data analysis with widely used R software environment.
When I started my career as a young bioinformatician at the dawn of the field, there were not many books available. For many years, I had a small, indeed very small, library of all published books in bioinformatics. Now there are so many books that one starts to doubt whether there is the need for all the new ones. However, that does not apply to this book. As probably the first dedicated professor in bioinformatics in Nordic countries, I was constantly looking for new text books to use in our research group as well as for teaching. I just wish that this volume would have been available a long time ago. It would have helped on numerous instances and saved time.
Csaba Ortutay came to my group in Finland as a postdoctoral fellow in the year 2004 to work for 2 years. In the end, he remained for years. During that time, I learned to know the couple and their children. Csaba turned out to be a resourceful, meticulous, and innovative scientist who participated actively in the work of the entire group. I gave him at one time a side project, which turned out to his major project for many years. I thought that it would take several months for him to get the work done; instead, he came back with results in few weeks. We investigated immunome—the entirety of genes and proteins essential for immune system. After I left for my current position, Csaba took care of my position for a while including the International Master's Degree Programme in Bioinformatics. Csaba and Zsuzsanna complement each other in many ways, which is visible in the contents of the book.
One characteristic of Csaba is evident in this text. He is a terrific teacher. He can understand the problems of newcomers, whether students or scientists in need of novel knowledge and methods. In addition, the couple together brings the text and examples to the practical level so that also those primarily working on wet lab instead of dry lab can easily obtain the essentials of methods and apply them to their own problems.
Science is increasingly done in big multidisciplinary teams, which is great, and it addresses many difficult research areas. The downside is that many tasks are done by experts. This applies often to statistics. Therefore, other scientists may find it difficult to test their ideas since they are not familiar with the principles and practice of statistics. This book provides the necessary information and confidence to try their own hands at data analysis. By being able to investigate the datasets oneself, it will become possible to address new questions and maybe even find some completely new correlations and properties. The book covers a wide array of research topics dependent on statistics, ranging from sequence analysis and enrichment analysis to next‐generation sequencing (NGS) data analysis, from gene and protein expression to networks they form, and from immunoassays to cell sorting. Studies in none of these areas are no more possible to get published without professionally made, detailed, and thorough statistical analysis. Statistics provides the toolbox to tease out the significant findings from large, and not often so large, datasets.
To find answers to those questions, turn the page.
Mauno VihinenProfessorLund UniversitySweden
During the time when I was responsible for directing the International Master's Degree Programme in Bioinformatics at the University of Tampere, Finland, I was actively monitoring the trends of relevant job markets to see what new subjects are beneficial to our students. Recently, the rapid spread of high‐throughput analytic methods in molecular biology initiated new trends in the bioinformatics scene, which earlier focused mostly on sequences, databases, protein structures, and the related algorithms. The term “big‐data,” already well known in computer science and business analytics, started to spread in the context of genomics, medical informatics, and other molecular biology‐related fields.
Since the amount of data coming from these new experiments is several orders of magnitudes larger than what was experienced earlier, often completely new approaches are needed for handling them. Data have to be collected, stored, managed, and analyzed. During the past few years, this field developed dynamically, and several analysis tools and approaches were developed, especially in relationship with NGS and the derived technologies. On the data analysis and visualization front, R and Bioconductor established a solid market share. Research groups and industrial employers started to seek people who have experience of this language, driving the value of R skills up.
R and Bioconductor libraries are developed very quickly and they become more and more complex. It is not a trivial task for self‐learners to develop these skills on their own. Though more and more online tutorials and how‐tos are available on different online forums and developer sites, they are often obsolete or too narrowly targeted. It is increasingly problematic to piece together a complete analysis from the material available online, especially for someone, who is a novice in the world of R or molecular data analysis.
R itself has a relatively flat learning curve, meaning a lot of efforts should be dedicated until “production ready” proficiency is reached. The complexity of molecular biology‐specific algorithms, often unfamiliar to those from outside of this field, only adds to these challenges. The net results of these effects are that it is increasingly hard for new learners to enter into the field in a time, when demand is growing dynamically.
Motivated by this situation, I decided to introduce a course series in our program in Finland, teaching our students the basics of R using molecular biology‐related case situations often coming from real‐life laboratory data. Later on, a first course was continued by another one extending the foundations to high‐throughput applications with a strong focus on NGS data.
During the time period when these courses were still under development, I have got more and more inquires from wet‐lab researchers working in molecular biology laboratories, among others from my wife and her colleagues, about how to analyze diverse datasets in R, which are becoming more and more popular among them too. This was the point where we started to think about how to make these courses available to the public. Today, we deliver our R courses to universities of three countries in Europe, and many individuals via our e‐learning services.
This book offers the content of our courses arranged by the molecular biology topics. This structure serves readers who already have some prior experiences with R language as well as who are new to this segment of biodata analysis.
My recommendation for those who wish to learn R without much prior experience is to start with the chapters covering methods producing smaller amounts of data, such as Simple sequence analysis (Chapter 2), Annotating gene groups (Chapter 3), Quantitative transcriptomics: qRT‐PCR (Chapter 5), and Measuring protein abundance with ELISA (Chapter 12). In these chapters, the readers can practice the most frequently used approaches in R, and they can become ready for handling larger datasets, such as those from microarray‐ or NGS‐based experiments.
All the chapters begin with providing the necessary amount of theoretical background about the experimental methods at hand. We aimed at distilling the available large volumes of theories to the bare essentials, which are still needed to interpret the analyzed data correctly. In my opinion, one pitfall with handing the analysis of molecular biology data to information science experts is that they often neglect the constraints coming from the experimental setups themselves.
Later on, detailed practices are explained with code examples with help also with the interpretation of the produced results. Ideally, the reader should be able to see the complete workflow of an analysis type from the raw data until the presentation of the results. According to our experience, this structure offers the learners the most versatile benefits.
The code example and the data files used in this book are available from GitHub together with scripts used to generate the plots on these pages. Since the used libraries and best practices are developed continuously, the online material is continuously updated to comply with these changes:
https://github.com/csortu/MDAuR
Finally, I am convinced that the readers of this book will learn skills from these chapters, which will help them to advance their research career, equip them with competitive skills on the job market, and in general, advance the molecular biology field by teaching people about these tools.
Csaba Ortutay, PhDCEO, HiDucator Ltd.Adjunct Professor of Bioinformatics,University of TampereFinland
This book is the result of scientific discussions over the dinner table. As a wet‐lab researcher, I face several difficulties in planning, carrying out, and—last but not least—analyzing scientifically significant experiments. As any couple, we discuss everyday happenings during family dinner, such as what is going on in the workplace—in the lab in my case. I feel lucky that I can get advice when I feel stuck with my analyses. I have somewhere to turn when the results just do not look like I thought they should. Are those results I am waiting for really among the data? Several times we got experimental data, but I am not sure how to interpret them. How to generate publication‐ready figures from those microarray data? Should I first calculate the average and then fit a curve to my data points, or vice versa? What software should I use to calculate if there is a significant difference between the Ct values originating from a quantitative real‐time polymerase chain reaction (qRT‐PCR) experiment comparing treated and control cells? Csaba's answer is always the same: Why don't you use R? So, I started to get acquainted with R. It was strange and frightening at the beginning, but luckily I had a very good and a friendly teacher, a teacher from the family. I asked and asked and asked, and he answered and answered and answered. I even started to ask questions on behalf of my colleagues, so the idea of this book was born. Why do we not help others like me? I came up with different types of experiments going on in our lab, and we went through the analysis process step by step together. I asked those questions that a typical wet‐lab researcher would ask and Csaba supplemented the list of questions from the computer scientists and bioinformatics students' points of view. Now, when we are at the end of the answering process and the editing of the text, I feel satisfied and proud. I have learned how to use R for counting statistics, plotting results, and generating figures. It is still not easy for me, but now I have this book to look for solutions and examples, when I am stuck. I recommend this book to those who want to understand the hows and whys of the data analysis process; for wet‐lab researchers, who wish to analyze the experimental data themselves and do not want to wait for data analysts; and for students, either from the field of molecular biology or bioinformatics, who will use R in their career. I also recommend this book to computational scientists who got experimental results to analyze but who have no clue what the experiment is all about. I recommend this book for you, dear reader. I hope you will enjoy it and get useful tips and solutions for your problems. Feel free to experiment with the provided data; try not only those demonstrated solutions, but also write commands yourself. Make mistakes since from those you can learn the most! Look for new commands and try out what you have just learned on your own data! And most of all, enjoy your journey in the world of data analysis!
Zsuzsanna Ortutay, PhDPostdoctoral Researcher,University of TampereFinland
Before we started to assemble the material of this book, we were sure that this process will require a considerable amount of work. Without the assistance of a large number of friends, colleagues, and helpful people, this journey would have been much harder for us, perhaps even fruitless. We would like to thank everyone who supported our efforts with ideas, reading our manuscript, or encouraging us along this road.
While writing, we consulted people who have first‐hand experience of the area in question, and who were kind to offer their comments on our text. Dr Martti Tolvanen from the University of Turku, Finland, helped us with the Chapter 3; Ms Milena Doroszko from the University of Turku, with Chapter 5; and Dr Laura Pöyhönen with Chapter 13. Dr Anna Grönholm and Harlan Barker also assisted us with their comments on multiple parts and chapters.
The practical parts of this book could not be complete without appropriate datasets. While it is easy to obtain data for some methods from public databases, it is virtually impossible to get raw data for others. The example datasets for Chapters 5, 12, and 13 originate from Zsuzsanna's laboratory. We are grateful for her supervisors, Adjunct Professor Marko Pesu, and Dr Ilkka Junttila from the University of Tampere, Finland, for giving us permission to use the raw data from their published and even unpublished works.
Csaba's former principal investigator from his postdoc times, Professor Mauno Vihinen from Lund University, Sweden, helped us to find and contact our publisher. He has also offered his help by writing Foreword to our book. Thank you Mauno for all of your support, without which this book couldn't have been written.
At last, but not least, we are grateful to editor Mindy Okura‐Marszycki, who offered her helpful comments on practical issues. As we are inexperienced in the world of book publishing, these thoughts were infinitely valuable for assisting our book‐writing effort.
Don’t forget to visit the companion website for this book:
Scan this QR code to visit the companion website
www.wiley.com/go/ortutay/molecular_data_analysis_r
There you will find valuable material designed to enhance your learning, including:
Data
Figures
Scripts
If you work in the field of biodata analysis, or if you are interested in getting a bioinformatics job, you can see a large number of related job advertisements targeting young professionals. There is one common topic coming back in those ads: they demand “a high degree of familiarity with R/Bioconductor.” (Here, I am quoting an actual recent ad from Monster.com.)
Besides, when we have to create and analyze a large amount of data during our bio‐researcher career, sooner or later we realize that simple approaches using spread sheets (aka the Excel part of MS Office) are not flexible anymore to fulfill the needs of our projects. In these situations, we start to look for dedicated statistical software tools, and soon we encounter the countless alternatives from which we can choose. The R statistical environment is one among the possibilities.
With the exponential spread of high‐throughput experimental methods, including microarray and next‐generation sequencing (NGS)-based experiments, the skills related to large‐scale analysis of data from biological experiments have higher and higher value. R and Bioconductor offer a free and flexible tool‐set for these types of analyses; therefore, many research groups and companies select it as their data analysis platform.
R is an open‐source software licensed under the GNU General Public License (GPL). This has an advantage that you can install R for free on your desktop computer, regardless of whether you use Windows, Mac OS X, or a Linux distribution.
Introducing all the features of R thoroughly at a general level exceeds the scope and purpose of this book, which is to focus on molecular biology‐specific applications. For those who are interested in a deeper introduction into R itself, it is suggested reading the book R for Beginners by Emmanuel Paradis as a reference guide. It is an excellent general guide, which can be found online (Paradis 2005). In the course, we use more biology‐oriented examples to illustrate the most important topics. The other recommended book for this chapter is R in a Nutshell by Joseph Adler (2012).
The first task of analyzing data with R is to install R on the computer. There is a nice discussion on the bioinformatics blogs about why people so seldom use their knowledge acquired on short bioinformatics courses. One of the main considerations points out that it is because the greatest challenge is to install the software in question.
There are plenty of available information on the web about how to install R, but the most authentic source is the website of the R project itself. In this page, the official documentation, installer, and other related links from the developers of R themselves are collected. The first step is to navigate to the download section of the page and find the mirror pages closest to the location of the user.
However, there are some differences in the installation process depending on the operating system of the computer in use. Windows users should find the Windows installer to their system from the download pages. It is useful to check for the base installer, not the contributed libraries. In the case of a Linux distribution, R can be installed via the package manager. Several Linux distributions provide R (and many R libraries) as a part of their repositories. This way, the package manager can take care of the updates. Mac OS X users and Apple fans can find the pkg file containing the R framework, 64‐bit graphical user interface (GUI) (R.app) and Tcl/Tk 8.6.0 X11 libraries for installing the R base systems on their computer. Brave users of other UNIX systems (i.e., FreeBSD or OpenBSD) can use R, but they should compile it from the source. This is not a beginner topic. In the case of a computer owned by a company, university, or library, the installation of R (just like many other programs) requires most often superuser rights.
The interface of R is somewhat different from other software used for statistics, such as SPSS, S‐plus, Prism, or MS Excel (which is not a statistical software tool!). There are neither icons nor sophisticated menus to perform analyses. Instead, commands should be typed in the appropriate place of R called the “command prompt”. It is marked with >. In this book, the commands for typing into the prompt are marked by fixed‐width (monospaced) fonts:
> citation()
After typing in a command (and hitting Enter), the results turn up either under the command or, in case of graphics, in a separate window. If the result of a command is nothing, the string NULL appears as a result. Mistyping or making an error in the parameters of a command leads to an error message with some information about what was wrong.
> c()
NULL
> a * 5
Error: object 'a' not found
From now on, we will omit the > prompt character from the code samples so you can just copy/paste the commands. Leaving R happens with the quit() function.
quit(save='no')
q()
A command‐line interface is enough for performing the practices. However, some prefer to have GUI. There are multiple choices depending on the operating system in use. The Windows and Mac versions of R starts with a very simple GUI, while Linux/UNIX versions start only with a command‐line interface. The Java GUI for R is available for any platform capable of running Java, and it sports simple, functional menus to perform the most basic tasks related to an analysis (Helbig, Urbanek, and Fellows 2013).
For a more advanced GUI, one can experiment with RStudio or R Commander (Fox 2005). There are several plugins to integrate R into the best coding production tools, such as Emacs (with the Emacs Speaks Statistics add‐on), Eclipse (by StatET for R), and many others.
Doing data analysis in R means typing in commands and experimenting with parameters suitable for the given set of data. At a later stage, the procedure will be repeated either on the same data with slight modifications in the course of the analysis, or on different data with the same analysis. For example, the analyzed data are submitted to publication, but the manuscript reviewers request slight modifications in the analysis. It means to repeat almost the entire process, but parameter x should be 0.6 instead of 0.5 as used earlier.
Scripts are used to register the steps of an analysis. Scripts are small text files containing the commands of the analysis one after the other, in the same order as are issued during the data processing. Traditionally, we use “.R” extension (instead of .txt) for these text files to mark that these are R script files. Script files are the solution for
archiving an analysis,
automate tasks that take much time to run.
Script files can easily be included into an analysis flow called “sourcing” (the term is borrowed from other computer script languages) by issuing the source() command. For example, let's have the following script file my_first_script.R:
a<-rep(5,5)
b<-rnorm(5)
print(a)
print(b)
print(a*b)
Scripts can be created using any text editor (i.e., gedit, mcedit, Notepad) but not with a word processor software (i.e., MS Word, LibreOffice Writer, and iWork Pages) unless it is possible to save it as a text file, and not a .doc, .docx, .odt, or any other more complex formats. R should be led to the location where the saved file can be found:
setwd('/home/path/to/your/files') #on Linux/UNIX
setwd('/Users/User Name/Documents/FOLDER') #on Mac
setwd('c:/path/to/my/directory/') #on Windows
The working directory can be checked using the getwd() command.
Loading the script file in the working directory is simple:
source('my_first_script.R')
If the script is somewhere else, the path is desired:
source('/path/to/my_first_script.R')
When starting R first time, it creates two files to register what was done: the history and the environment file. If R was started from the command line, these files are saved in the directory where R was started. Launching R by an icon results in saving the history and the environment file to a default place.
The history file is a text file that saves all the commands issued in a session with R, while the environment file holds the data used during the session. It is worth saving these files for further use with the savehistory(file = “/path/to/Rhistory”) and save.image(file = “/path/to/RData”) commands. When exiting R by using the q() command, it asks whether you want to save these to the default places. Choosing this option leads to start R from the same directory next time, and it will also remember the past work and data.
Statistics is a huge field, and many disciplines use it for their specific purposes. All of them have different needs, flavors, and data types specifically designed for their needs. It would be meaningless and hopeless to put everything into a single software. Honestly, the majority of the code would never be used as a bioinformatician rarely uses statistics designed for particle physics, likewise a computational chemist rarely reads in data from gene expression microarrays.
To address this problem, R developers have decided to provide only a common framework and some basic functionality as part of the base installation, and subject‐specific elements are organized into code bags called “packages.” In reality, the base installation of R is not very useful for molecular data analysis. Its most useful part is that suitable packages can be found for most of the often applied analysis types.
R packages are collected into so‐called package repositories on the web. These places are dedicated to the maintenance and distribution of the packages. The concept is probably familiar to Linux users. R uses its own internal code called “package management system” to find, install, and update packages. There are two important package repositories, which are also used in this book: Comprehensive R Archive Network (CRAN) and Bioconductor.
(R Core Team 2015) is a place for general‐purpose packages, but many biology‐related packages can be found here too. One can search packages related to the topic of interest (left side of the page, Software/packages/table of available packages, sorted by name) by keyword search. For example, if some biological sequences‐related packages are required, searching (Ctrl+F) for the keyword “biological sequence” on this page will result in those soon.
Here, we introduce the sequences package (Gatto and Stojnic 2014). Clicking on the name of the package leads to a general information page. The most relevant documents here are the Vignettes (if they are available), providing a quick introduction to the package, and the reference manual that shows an extensive explanation for all the commands and datasets provided by the package.
Installing and managing CRAN packages is best done within R itself. Most GUIs provide some assistance for package management in the “Packages” menu. It is simple to install packages using the install.packages() command. Picking up the packages and their dependencies requires Internet access. The installation process can take much time if the selected package has many other packages to depend on.
install.packages("sequences")
On Linuxinstall.packages() works properly if it is issued in an R session of the root, or if a library is specified as the package directory to write:
install.packages("sequences", lib="/home/mydir/Rpackages/")
The full list of available packages can be checked using available.packages(). This command lists all the packages compatible with the version and operating system on the computer in use. It often means many thousands of packages.
ap<-available.packages()
row.names(ap)
Loading a successfully installed package (e.g., the sequences package in the previous example) is done using the library() command (without quotation marks around the package name this time).
library(sequences)
There is another R package repository dedicated mostly to the analysis of high‐throughput data from molecular biology, called “Bioconductor” (Gentleman et al. 2004). It contains more than 1500 packages dedicated to this exciting field of bioinformatics. The packages are divided into three groups:
Software—This section contains the most interesting packages that can assist different kind of analysis. This sub‐repository is roughly analog to CRAN in the sense that the packages here provide the statistical methods and procedures, such as microarray normalization functions or enrichment analysis approaches.
AnnotationData—Here is a collection of very important supporting information concerning genome, microarray platform, and database annotation. These packages are useful mostly as input data for other packages in the Software section.
ExperimentData—Prepared experimental data are available from here for further analysis. It is a good idea to test a new statistics or analysis approach on data from here first. This effort will assure that the code in use is compatible with the rest of Bioconductor's framework.
The packages are listed in a logical and hierarchical system, and it is relatively easy to find relevant packages for a certain type of analysis. For example, if mass spectrometry is in the focus of interest, the relevant packages can be found in the Software ‐> Assay Technologies ‐> Mass Spectrometry branch of the hierarchy; while in case of inferring networks from experimental data, the Software ‐>Bioinformatics ‐> Networks ‐> Network Inference branch should be checked. The vignette and the reference manual appears in a similar way on the dedicated page of the chosen package as it is in CRAN.
There is another, perhaps even more practical, way to find suitable packages from Bioconductor. There are complete recipes for more popular data analysis tasks in the Workflows section of the Bioconductor page Help menu, which not only shows the needed packages but also demonstrates how to use them.
Bioconductor uses its own package management system that works somewhat differently than the stock R system. It is based on a script titled biocLite.R, which can be sourced directly from the Internet:
source("
http://bioconductor.org/biocLite.R
")
This script contains everything needed for managing Bioconductor packages. For example, to install the affy package (Gautier et al. 2004), the biocLite() command should be called:
source("
http://bioconductor.org/biocLite.R
")
biocLite("affy")
This command processes the dependencies and installs everything in need. The annotation and experimental data packages tend to be huge, so high‐speed Internet (or a lot of patience) and sufficient amount of disk space is needed to install them.
Loading of the installed packages happens in the same way as with CRAN packages:
library(affy)
For a data analysis project, well, data are needed. It is a crucial question, how to load data into R, and that is often the second biggest challenge for a newbie bioinformatician. Several R tutorials start to explain this topic by introducing the c(), edit(), and fix() commands. These are commands and functions used to type in numbers and information in a tabular format. Also, these are the commands that are rarely used in a real‐life project. The cause of this is simple: no‐one type in the gene expression values of 40,000 gene probes for a few dozens of samples.
Most often data are loaded from files. Files might come out from databases, from measurement instruments, or from another software. Often data tables are assembled in MS Excel. MS Excel or other spreadsheet software can also export data tables as .csv files, which are easy to load to R. Depending on the operating system in use and the exact installation of R, there are multiple possibilities to read .xls files. The package gdata (Warnes et al. 2015) contains the read.xls() command, which can access the content of both .xls and .xlsx files:
library(gdata)
my.data<-read.xls("data_file.xlsx", sheet=1)
This code reads in a table from the first sheet in the .xlsx file into the my.data data frame. This is an excellent tool, but it requires the installation of Perl (a scripting language) on the computer. In Linux/Unix installations it is not a problem, but in Windows environments, it is not easy to solve. A universal solution of this problem is to read data from exported .csv files. This approach works for all platforms, and it does not require the installation of additional packages:
my.data<-read.csv("fdata_file.csv",sep="\t",row.names=1)
The first step is to prepare a data table using Excel or another spreadsheet software. The data are then exported to a .csv file called “tabular text file” or “comma‐separated text file” in different software. It is important to specify the usage of a tab as a field separator (sep=“\t”) in the settings instead of a comma that is usually the default field separator for .csv files.
To handle the problem of transferring data from MS Excel to R, a new package called readxl has been released recently (Wickham 2015). The ultimate goal of this package is to provide accessibility to data saved in Excel files without further dependencies, and in an operating system‐independent way.
There are many proprietary file formats coming out from different instruments. There are dedicated packages developed to read their content and load it in proper data structures in R for further analysis. For example, the ReadAffy() command from the affy package is designed to import Affymetrix GeneChip CEL files. Similarly, the read.fasta() command of seqinr package (Charif and Lobry 2007) or the readFASTA() command of Biostrings package (Pages et al. 2015) can import FASTA formatted sequence files.
This book has a dedicated support webpage. Here, all the R scripts and data are available to do all the practices discussed in the following chapters. As bonus material, the scripts used for generating the figures on these pages are also available from the same place.
Save the file furin_data.csv from the webpage of the book, and open it with a text editor. There are rows and columns of the data in the file. The first step for now is to set the exact path to the location of the file in the furin.file variable, and use the read.csv() command to read its command to the my.data variable. Checking the structure of the data happens by using the str() command.
furin.file<- '/path/to/your/file/furin_data.csv'
my.data<-read.csv(furin.file,sep="\t")
str(my.data)
All scripting languages provide simple ways to perform basic computational operations on data. R is not different in that sense. Certainly, the most basic things like arithmetic operations work as expected. For example, adding and multiplying with numbers the same way as in math classes:
4 + 7
6 * 2
Of course, R is not the most suitable choice if only a calculator is needed. R is used to store numbers, information, and data, and also to perform different tricks and calculations on those. For those, who know one or other programming languages, it is clear that variables should be used. For the sake of those who are not familiar with these issues: variables are similar to labeled “shoe‐boxes” containing data items. During an analysis the data items can be stored in these “shoe‐boxes” instead of reading them from file for each operation. The arrow mark (or assignment operator) is used for loading any data, for example, a number, into these variables.
my.data <- 5
Now the number 5 is loaded into the my.data variable. The direction in which way the arrow points tells the story. For example, the result of an operation can be stored in a variable, and later on the data inside of the variable can be the subject of further operations.
my.data <- 5 + 3
my.other.data <- my.data * 2
Typing the name of the variable will show what is inside it:
>my.data
8
>my.other.data
16
The variables in R can store a great many types of different things like numbers, list of numbers, strings, sequences, data tables, data matrices, entire genomes, or multiple sequence alignments. Several operations have different meanings depending on what kind of data are applied to them. R is smart enough to figure out if a command has a different version specifically fit for a particular data type.
a <- 5
a + 3
b <- c(5,6,7)
b + 3
In the previous example, there are two very different variables: a and b. Variable a holds a single number (5), while variable b holds a vector of three numbers. The addition (+) operator guesses that applying it to a single number (variable a), it should add 3 to a single number. However, in case of vector (b), it will add 3 to all the numbers in the vector. This distinction is crucial, as the result of the first operation is a single value, while the result of the second one is a vector itself.
R has this kind of smart redundancy that is especially handy with the plot() command. There are several data types that are represented in graphs and figures, which are very often generated by the plot() command. Specific data types have their specific plots, and R packages are well prepared to draw different plots for them. Using the furin_data.csv file again as an example, different graphs can be generated by of plotting one column of the data table (Figure 1.1), or all of them for checking their correlation (Figure 1.2):
furin.file<- '/path/to/your/file/furin_data.csv'
my.data<-read.csv(furin.file,sep="\t")
plot(my.data$Naive.KO.1)
plot(my.data)
Figure 1.1 The plot() function produces a scatter plot when it is called on a column of a data frame. Unless specified differently, the values appear in their order in the data frame itself.
Figure 1.2 Calling plot() on multiple columns of a data frame results a correlogram among the columns of the data frame.
