R Programming for Mass Spectrometry - Randall K. Julian - E-Book

R Programming for Mass Spectrometry E-Book

Randall K. Julian

0,0
111,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

A practical guide to reproducible and high impact mass spectrometry data analysis

R Programming for Mass Spectrometry teaches a rigorous and detailed approach to analyzing mass spectrometry data using the R programming language. It emphasizes reproducible research practices and transparent data workflows and is designed for analytical chemists, biostatisticians, and data scientists working with mass spectrometry.

Readers will find specific algorithms and reproducible examples that address common challenges in mass spectrometry alongside example code and outputs. Each chapter provides practical guidance on statistical summaries, spectral search, chromatographic data processing, and machine learning for mass spectrometry.

Key topics include:

  • Comprehensive data analysis using the Tidyverse in combination with Bioconductor, a widely used software project for the analysis of biological data
  • Processing chromatographic peaks, peak detection, and quality control in mass spectrometry data
  • Applying machine learning techniques, using Tidymodels for supervised and unsupervised learning, as well as for feature engineering and selection, providing modern approaches to data-driven insights
  • Methods for producing reproducible, publication-ready reports and web pages using RMarkdown

R Programming for Mass Spectrometry is an indispensable guide for researchers, instructors, and students. It provides modern tools and methodologies for comprehensive data analysis. With a companion website that includes code and example datasets, it serves as both a practical guide and a valuable resource for promoting reproducible research in mass spectrometry.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 667

Veröffentlichungsjahr: 2025

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Table of Contents

Title Page

Copyright

Dedication

Foreword

Preface

Acknowledgments

About the Companion Website

Chapter 1: Data Analysis with R

1.1 Introduction

1.2 Modern R Programming

1.3 Bioconductor

1.4 Reproducible Data Analysis

1.5 Summary

Chapter 2: Introduction to Mass Spectrometry Data Analysis

2.1 An Example of Mass Spectrometry Data Analysis

2.2 Using the Tidyverse in Mass Spectrometry

2.3 Dynamic Reports with RMarkdown

2.4 Summary

Chapter 3: Wrangling Mass Spectrometry Data

3.1 Introduction

3.2 Accessing Mass Spectrometry Data

3.3 Types of Mass Spectrometry Data

3.4 Result Data

3.5 Example of Wrangling Data: Identification Data

3.6 Wrangling Multiple Data Sources

3.7 Summary

Chapter 4: Exploratory Data Analysis

4.1 Introduction

4.2 Exploring Tabular Data

4.3 Exploring Raw Mass Spectrometry Data

4.4 Chromatograms and Other Chemical Separations

4.5 Summary

Chapter 5: Data Analysis of Mass Spectra

5.1 Introduction

5.2 Molecular Weight Calculations

5.3 Statistical Analysis of Spectra

5.4 Summary

Chapter 6: Analysis of Chromatographic Data from Mass Spectrometers

6.1 Introduction

6.2 Chromatographic Peak Basics

6.3 Fundamentals of Peak Detection

6.4 Frequency Analysis

6.5 Quantification

6.6 Quality Control

6.7 Summary

Chapter 7: Machine Learning in Mass Spectrometry

7.1 Introduction

7.2 Tidymodels

7.3 Feature Conditioning, Engineering, and Selection

7.4 Unsupervised Learning

7.5 Using Unsupervised Methods with Mass Spectra

7.6 Supervised Learning

7.7 Explaining Machine Learning Models

7.8 Summary

References

Index

End User License Agreement

List of Tables

Chapter 2

Table 2.1: MS2 spectral quality summary.

Chapter 3

Table 3.1: Overall investigation description.

Table 3.2: Instrument method data.

Table 3.3: Some basic XPath expression syntax.

Table 3.4: MS experimental result file types.

Table 3.5: Vendor raw data formats.

Table 3.6: Open data formats for raw mass spectrometry data.

Chapter 5

Table 5.1: Calculation of monoisotopic mass for

Table 5.2: Common positive ion adducts and their m/z values

Table 5.3: Calculation of adducts to codeine: C

18

H

21

NO

3

Chapter 7

Table 7.1: Cosine similarity search results

Table 7.2: Euclidian distance search results

Table 7.3: Tanimoto coefficient search results

Guide

Cover

Table of Contents

Title Page

Copyright

Dedication

Foreword

Preface

Acknowledgments

About the Companion Website

Begin Reading

References

Index

End User License Agreement

Pages

iii

iv

v

ix

x

xi

xii

xiii

xv

xvi

xvii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

R Programming for Mass Spectrometry

Effective and Reproducible Data Analysis

Randall K. Julian

Indigo BioAutomation, Inc.

Carmel, IN, USA

Copyright © 2025 by John Wiley & Sons, Inc. All rights reserved, including rights for text and data mining and training of artificial intelligence technologies or similar technologies.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750–8400, fax (978) 750–4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748–6011, fax (201) 748–6008, or online at http://www.wiley.com/go/permission.

Trademarks

Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty

While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762–2974, outside the United States at (317) 572–3993 or fax (317) 572–4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data has been applied for.

Print ISBN: 9781119872351

ePDF ISBN: 9781119872368

ePub ISBN: 9781119872399

oBook ISBN: 9781119872405

Cover Design: Wiley

Cover Image: © Valery Rybakow/Shutterstock

To Lauren:

You made it possible for me to start this book. Thank you for all your help while I took time away from you to finish it.

Foreword

This is not only a book about R coding, rather it is an indicator of where we are in the journey from an analog real world to one ruled by digital data. We stand at an inflection point on the science/technology pathway. Behind us lies the golden ages of empirical discovery science, more attractive as it recedes into the distant past…perhaps to the steppes of Central Asia, where the night skies whispered their thousand questions and imaginative answers – supported by primitive observations – were provided in response. Now, many mysteries have been removed and replaced by high-quality data. This is not an unalloyed good, and there must be some regret, even among those who do not march under the “Stop the Technology Madness” banner.

Author Randy Julian’s own life in science has covered the before and after of this inflection timepoint, between the sparse-data era and the current rich-data hegemony. He started his PhD in Chemistry in 1990 with measurements of reactive collisions of polyatomic ions impacting monolayer surfaces in a vacuum, as shown by the simple bond-formation reaction illustrated in Surface reactions and surface-induced dissociation of polyatomic ions at self-assembled organic monolayer surfaces [1] / American Chemical Society.

Randy Julian was working in a basement lab where for decades the time taken for data to emerge from instruments was so slow that it was possible to think about it as it arrived, to adjust the planned experiment, and even to draft an outline of a planned scientific publication. A perfect resonance existed between the instrumentation and the experimenter. The dramatic nature of the changes then underway is seen in work from Randy Julian’s second year at Purdue, when he coded a multiparticle numerical simulation of the trajectories of realistic numbers of ions under the actual electrical fields of a mass spectrometer and matched it to the experimental mass spectrum, including peak shapes. Then, in a tour de force experiment, Large Scale Simulation of Mass Spectra Recorded with a Quadrupole Ion Trap Mass Spectrometer [2], likely one of the earliest studies in analytical chemistry to use supercomputers, he simulated the individual ion dynamics of a system of thousands of ions in an electric field and subjected to collisions by harnessing computers ranging from the Connection Machine at Los Alamos National Laboratory to the Cray to solve the classical equations of motion in sub-microsecond increments using parallel and vector processing.

Three years later, as a fresh employee at Eli Lilly & Co. in Indianapolis, Dr. Julian persuaded management to provide him with a factory of mass spectrometers so that he could examine millions of chemical constituents generated by Lilly’s massive collection of natural product extracts. Many current drugs come from such sources, making it a worthwhile challenge. It took over two years to record the mass spectra of all these samples. Dr. Julian’s subsequent career has been along similar lines, only faster and wider in scope. His company (Indigo BioAutomation, Inc.) currently processes and validates some 300 million mass spectrometry/liquid chromatography clinical tests a month. They perform data quality assurance for every major diagnostics laboratory in the United States – it is likely that your physician has received data analyzed by Indigo. Had Randy lived in California instead of Indiana, a career like his would have been emblazoned on T-shirts and funded rock concerts. What does one say of a scientist whose command of experimental data is such that the difference, at six sigma, between Gaussian and Lorentzian curves can form the basis for solid medical decisions.

Randy Julian’s career has been centered on analytical chemistry, a topic with wide societal applications and now often referred to as measurement science to avoid the use of offensive terms. The subject has two components; the instrumentation used to make the measurement and the digital processing used to maximize the quality of the information output. Data science provides access to the information produced by instrumentation acting on chemicals. Dr. Julian’s work and this volume focus on the latter, but a major impact on the design and utilization of novel instrumentation is likely too. Instruments like mass spectrometers record spectra on an entire world of chemicals, in atomic, ionic, or molecular form, as pure materials or organized into biopolymers, or in the form of neurons or biofluids, or whole organisms. The rapid growth in data science and technology, which is central to this book, has allowed the extraction of detailed information, often biomedical in nature, from this data. This development transformed the way science is done and how it is reported. The author himself played a significant role in persuading the editors of leading journals to require that the published data on which published conclusions were to be based be archived and widely available. This has allowed quality checks on data to be performed post facto to ensure the validity of the conclusions. Not surprisingly, a “crises of reproducibility” is being experienced as some important studies fail to withstand the withering examination now possible.

Unarguably, the landscape of science and technology has undergone highly significant changes in the past three decades, but what fueled this daemon? Perhaps the beast is omnivorous, requiring physics and microdevices and systems of knowledge and algorithms to be combined and harnessed to the instrumentation that provides the analytical chemical data. The traditional tools of academia – books and lectures – were successful in driving this transformation. Books like Diefendorfer’s ground-breaking Principles of Electronic Instrumentation addressed the marriage of physics and EE to analytical measurements. A required graduate course in the Analytical Chemistry program at Purdue University has also made a remarkable contribution to the combination of data handling and chemical measurement. Now in its fifth decade, CHM 621 was initiated and taught for many years by Prof. Fred Lytle, and it has served as the foundation upon which hundreds of PhD students built their experiments and data interpretations. Randy Julian was enrolled in this course in 1990, and during the next two years, Randy taught a course on how to write programs based on the principles taught in Fred’s course. The course has had a wide impact over the years in samizdat lecture notes. After Randy left Eli Lilly & Co. and started Indigo BioAutomation, Fred retired from Purdue to join Randy at Indigo. In this book, Randy has updated his Purdue programming course notes and combined them with his experience over the past three decades. Meanwhile, Fred Lytle has also been reworking his lecture notes and will soon complete his book. Over the years, they have made a formidable team.

This book will help create big-data scientists, but it is likely also to stimulate readers to improve the next generation of analytical instruments. Data science has already allowed mass spectrometry to tame the ‘wildness’ of biology and produce reliable, actionable information. Randy Julian’s graduate work concerned a single bond-forming reaction occurring on a nearly perfectly characterized surface, and the readout consisted of intensity measurements in two channels of mass using essentially time-invariant signals. The data processing tools in this book accommodate millions of analytes from millions of patients. As Brison Shira, a current PhD student, commented, “Randy once applied MS to single analytes, and he now applies it to populations of patients.” There is power in this book.

R. Graham Cooks

West Lafayette, June 2024

Preface

This book will teach you how to analyze data generated by mass spectrometers using R [3]. The modern mass spectrometer is a marvel of science and engineering. What was once an imposing instrumental method of analysis with limited application has now become a workhorse in research and industry. At the same time, the boundaries of measurement capability are rapidly expanding with each new generation of analyzer, detector, and ionization method. At the outer limits are instruments that still act like temperamental thoroughbreds, which, on any given day, deliver extraordinary results or confusing noise. Workhorse instruments, on the other hand, often operate in a factory-like mode, producing data that is changing how we discover and develop drugs, diagnose and treat diseases, and understand our drinking water, food, the oceans, and the atmosphere. Mass spectrometers are used for measurements in such large numbers that ensuring problems with data analysis do not corrupt results is critical. Well below the limits of performance, mass spectrometers can generate such huge volumes of complex data that the analysis is beyond simple statistics and enters the domain of data science. For nearly all of the uses of mass spectrometry, there is a need for more advanced and more reproducible data analysis than can be done in spreadsheets.

The Main Goal of this Book

The main goal of this book is to show how to analyze mass spectrometry data effectively and reproducibly using the R programming language. Any mass spectrometrist can learn to go beyond spreadsheets and build data analysis solutions using R in a reasonable amount of time. My approach will be like climbing a ladder. Through the lens of mass spectrometry, I will start by introducing native features of the R language. On the next rung are the packages that simplify data storage and retrieval, data manipulation, statistics, and visualization. The next step uses modules originally created to help with molecular biology tasks that also work with data from mass spectrometers. Further up the ladder are mass spectrometry-specific modules used to perform data manipulation and analysis for data generated specifically by mass spectrometers. Beyond that, the ladder goes on, but this book will end on the machine learning rung, far from the top.

Because the intended audience for this book is relatively broad, different sections will be of more value to some readers than others, so hopefully, familiar parts can be skipped. The example code is intended to show techniques and methods for analyzing mass spectrometry data that are effective and reproducible. However, within the example code, I hope you will find solutions to common problems that repeatedly appear in the analysis of mass spectrometry data. A word of warning: this is a code-heavy book, and the code is meant to be read. If some of the syntax is unfamiliar, please refer to some of the amazing books on R data analysis available. Along the way, I will provide pointers on where to find more information outside the scope of this book. I hope that some of the references will provide additional reading in areas of interest.

What You Will Learn

You will learn to analyze mass spectrometry data using R in a way that is widely accepted and supported by the data science community. In addition, you will learn to use various packages beyond the main R program to organize data, programs, and reports. Using examples from mass spectrometry research, you will learn how to understand your data, wrangle it into easy-to-manage structures, perform exploratory data analysis, visualize, and then analyze it to produce reproducible findings. You will also learn how to integrate description and discussion with data and code so you can build web pages and manuscripts about your analysis that other researchers can reproduce.

Conventions

This book contains three types of text: regular text, examples of R code, and output from code. In the body of the book, references to elements of the R programming language will appear in a monospaced text typeface. Code that you can type in and execute will be in a gray box and look like this:

a

<-

1

+

1

Also, R implicitly calls print() when a single variable name is given. But sometimes I will explicitly call print.

a

## [1] 2

When generating output, console output is shown with two hash symbols, ##. These symbols allow text to be distinguished from other text in this book and copied to into R, where the ## symbols are treated as comments. If the output is a series, there will be a leading number like [1] indicating the starting index of the series printed on that line.

Getting Started

To get started, download the latest version of R for your machine. R is an open-source project and is free. Installation packages are available for macOS, Windows, and Linux. The R project webpage is https://r-project.com, and you can find installation packages there.

I worked on the example for this book using the RStudio integrated development environment [4]. It is also free and also runs on macOS, Windows, and Linux. R can be used with its own user interface or the command line. It can also be used from other environments and editors. The examples should work with whatever you are comfortable with.

Once you have R installed, there are several add-on packages that you will need to run the examples. RStudio makes it easy to install and update packages from the R project repository called CRAN. In some chapters, I will give directions for the installation of select packages, but the following are the basic packages needed for most of the examples in this book.

list.of.packages

<-

c

(

"tidyverse"

,

"tidymodels"

,

"rmarkdown"

)

install.packages

(list.of.packages)

For mass spectrometry-specific packages, I will mostly rely on the Bioconductor project repository [5]. Unlike CRAN, Bioconductor packages are built around a core set of packages, and the entire collection is designed to be as interoperable as possible. In addition, the Bioconductor project is versioned as a whole and operates on its own release schedule to help improve interoperability and consistency. The BiocManager package [6, 7] is a specialized package management system used to install packages from Bioconductor. To install Bioconductor packages, install its package manager first:

install.packages

(

"BiocManager"

)

Several packages are used for reading raw mass spectrometry data: mzR [8–13], used for fast, low-level reading of open format XML files. MSnbase [14, 15] and Spectra [16] are both higher level packages that can use mzR, MsBackendMgf [17], MsBackendMsp [18], and other backends (format-specific file interfaces) to read data. Most packages used in this book are from either Bioconductor or CRAN; however, in Section 5.2.4, I will show how to install packages from other repositories, specifically from the source code repository system called GitHub.

To install Bioconductor packages, you just use the install() function from the BiocManager package:

bioc.packages

<-

c

(

"MSnbase"

,

"Spectra"

,

"mzR"

,

"MsBackendMgf"

,

"MsBackendMsp"

)

BiocManager

::

install

(bioc.packages,

update=

FALSE

)

After you have these packages installed and can run RStudio, you are ready to start.

About the Code and Examples in this Book

The code and information in this textbook have been carefully reviewed and tested to the best of the author’s ability. However, as with any programming resource, errors or inaccuracies may occur. The author and publisher make no warranties or representations regarding the accuracy, completeness, or suitability of the code and information presented.

Readers who use or implement any code from this book do so at their own risk. Neither the author nor the publisher shall be held liable for any damages or consequences arising from the use of the information or code contained herein.

It is recommended that readers thoroughly test and validate any code before using it in production environments.

The examples, citations, and references to external work or products in this book are used for instructional purposes only and do not constitute an endorsement by the author or publisher. Such references are provided solely to illustrate concepts and techniques discussed in the text.

Acknowledgments

I have many people to thank for being able to share this book with you. First, I would like to thank everyone at Indigo BioAutomation. I am fortunate to work on such an incredible team. I want to thank Prof. Fred Lytle, who mentored me as a graduate student and became an even more significant influence and friend after retiring from Purdue and coming to Indigo. I’d also like to thank Rick Higgs, who got me started with R in the 1990s and introduced me to machine learning before it was a buzzword. None of this work would have been possible without my PhD adviser, Prof. R. Graham Cooks, who, besides teaching me mass spectrometry, allowed me to teach a programming class to chemistry graduate students before finishing my thesis and then arranged for me to teach a graduate-level data science course at Purdue. I am forever thankful for my experiences at Purdue. Thank you to Russ Grant, Nigel Clarke, Brian Rappold, Patrick Mathias, and Shannon Haymond, with whom I’ve worked and taught and are now dear friends. You’ve all greatly impacted my development as a scientist and a person. I would especially like to thank Stephen Master who provided valuable comments and suggestions to an early draft of the book. I’d also like to thank the Harrold family: Dave, Chris, and Amber, for running the Mass Spectrometry & Advances in the Clinical Laboratory (MSACL) conference and giving me the opportunity to teach short courses and help expand the data science program. So many people have helped in my development that I cannot thank them all here, but no one helped me more than my parents, who supported my early addiction to programming computers. Thanks, Dad, for teaching me about computer hardware, and Mom, for putting up with me staying up all night with my brother Mike writing computer games.

About the Companion Website

This book is accompanied by a companion website:

www.wiley.com/go/julianrprogramming

The website includes:

Figures

Codes

Data

Chapter 1Data Analysis with R

This chapter will give an overview of R, the base R libraries, the Tidyverse packages, the Bioconductor project, and RMarkdown. I will also describe R scripting and the RStudio integrated development environment (IDE). If you are familiar with these topics, feel free to skip this introduction. The goal is for you to have a working R development environment, understand the basic ideas behind the tidyverse and the Bioconductor projects, and be able to use libraries and packages from both Comprehensive R Archive Network (CRAN) and Bioconductor.

1.1 Introduction

The R programming language [19] is an open-source project inspired by both the S language [20] and Scheme [21]. Over the decades since its initial development, the data science community has embraced R to an extraordinary level. While you can use almost any programming language for data science, R was one of the first freely accessible languages to make statistics its primary focus. Statistics is one of those subjects in which experts are practically necessary. For a nonstatistician, having highly reliable statistical functions improves the quality of analysis, especially compared to writing statistical algorithms from scratch. R is an interpreted language, and a community of dedicated experts continually updates it. Some of the best computational statisticians in the world actively support the statistical functions available in R. On top of these incredible contributions, the applied statistical community has created a fantastic array of add-in packages to handle specific analysis requirements. The core components of R and its vast library of packages allow for a wide range of statistical and visual analyses.

So why learn a programming language like R instead of just using a spreadsheet program like Excel? That’s a good question, which has a good answer. Excel has become very powerful over the years but has significant drawbacks for demanding data analysis tasks. First, each cell in a spreadsheet can be any data type; you can’t tell what it is by looking. A cell might look like a date, but it might also be a string. Or, it could have a formula that produces the content. The equation likely references other cells and is often created by cutting and pasting. Performing calculations this way makes all but the most trivial spreadsheets challenging to test and debug. Despite the limitations of spreadsheets, we almost all use spreadsheets for some tasks. But we have all experienced some errors when working with spreadsheets. This lack of robustness keeps most people working in data science away from spreadsheets. The one thing spreadsheets seem particularly good at is creating and editing text files (usually saved and loaded as comma-separated value or “CSV” files), but even here, trouble is just waiting to strike. CSV files often have a header that gives the names of the columns. When loaded into a spreadsheet, this row becomes another row in the sheet. When a spreadsheet has no header row in the data, a text file created from it will also have no header. At first, this may seem trivial, but since the top of a spreadsheet shows the names of the columns assigned by the program, the application-specific column names need to appear as text in the first data row. If someone reads the resulting text file assuming that a header is present and it’s not, then the first row of numeric data can be consumed as the header, and all of the data will then be loaded as if the read function skipped the first row. Again, while it sounds trivial, but mishandling header rows in spreadsheets has done tremendous damage to data analysis over the years. If you use a spreadsheet to help edit data, be careful in later analysis steps.

Another famous problem with spreadsheets is that some information will be interpreted by programs like Excel as dates when they are strings that look like dates. Excel will quietly change your data without warning, and if you don’t catch it, then when you save your file, some of the values may be corrupted by the string-to-date conversion. You can see a concrete example of this error: load a file that contains chemical abstract service (CAS) registry numbers. If you load the CAS number 6538-02-9 into Excel, for example, it will convert it into the date 2-9-6538, and then when you convert it to a number, you will get 1694036 (this is from an actual Microsoft support case from 2017 which I reproduced at the time of writing). People doing data science use spreadsheets all the time, but you have to be very careful and look for at least these two big problems.

You can perform data analysis in any computer programming language. While I will not cover them, Python and Julia are first-rate languages and good choices for any data analysis project. Python, in particular, has been the go-to language for the exploding machine-learning community. Like R, Python is an interpreted language with excellent community support. Many data analysts learn R and Python and switch between them depending on the project. The main difference is that the central focus of statistical analysis in R, whereas Python is a general programming language with good statistical libraries. Julia is different. Its community motto is: “Walk like Python; Run like C.” Julia is faster than Python and R in most cases, depending on the libraries you use. I encourage everyone working in data analysis to become familiar with Python and R. It will also pay to be aware of Julia. All three languages will run as automated scripts, and all three have development environments for writing more complex programs. Recently, there has been a trend toward using a notebook environment for programming, especially for Python with its almost addictive Jupyter Notebook system. Notebook environments allow mixing code with text by putting each in different types of cells. Opening a notebook and typing in natural language in some cells and code in others is a very agile way to work with code and data. However, working in a notebook can sometimes produce a mindset that you are not actually developing a program but just a document with some code mixed in. That mindset can lead to a lot of cut-and-paste programming, and other programming practices can make for messy and hard-to-reproduce analysis. It’s not a defect of the notebook concept but something to guard against when using them. Some people will start in a notebook environment, and if the program becomes complex, they will switch to an IDE. The method of mixing natural language text and code is so powerful that the approach can be used directly in the RStudio IDE for R. With RStudio, you don’t have to choose between working in an IDE or a notebook since both practices are supported.

R supports mixing natural language and code using the knitr package to implement literate programs [22], introduced below. One of my main objectives here is to show analysts how to improve the reproducibility of mass spectrometry data analysis. I will return to using R combined with knitr and RMarkdown to create literate programs throughout the book.

1.2 Modern R Programming

This section will teach you how to use R as a scripting language for batch processing and from within the IDE RStudio. Further, you will learn about the base packages of R and the modern approaches to data management and analysis introduced by the tidyverse collection of packages, including the plotting system provided by the ggplot2 package.

1.2.1 R as a Scripting Language

As described earlier, R belongs to the family of interpreted languages. In UNIX-type systems, languages like Perl, Shell-scripts, Ruby, and Python can be run as scripts by the OS. Any R program can be typed into a text editor and run from the command line as a script.

Take this trivial program:

# This program should be saved in a file called "hello.R"

print

(

"Hello, R"

)

To run this example and have the output display on in the console, you can use the Rscript program:

Rscript hello.R

The output to the console will be:

[1] "Hello, R"

When you want to run an R program as part of a noninteractive, automated process, you can use batch mode. Running in batch mode allows you to pass arguments to the program and have the output go to a file rather than the console. Starting the R interpreter with the options CMD BATCH puts the program into batch mode. The R interpreter will assume that the working directory is the current directory, which you may need to change depending on how your system runs automated scripts.

# leading './' is for the macOS, change this for your OS

R CMD BATCH ./hello.R

This will send all of the output of the program to a file called hello.Rout In this case, it is the output:

R version 4.3.1 (2023-06-16) -- "Beagle Scouts"

Copyright (C) 2023 The R Foundation for Statistical Computing

Platform: aarch64-apple-darwin20 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions.

Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.

Type 'contributors()' for more information and

'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or

'help.start()' for an HTML browser interface to help.

Type 'q()' to quit R.

> print("Hello, R")

[1] "Hello, R"

>

> proc.time()

   user system elapsed

  0.130 0.037 0.150

While running R programs as scripts from the command line is helpful, it is much more typical to write and run programs in an IDE. IDEs have been available for languages like C/C++ and other compiled languages for decades. Various R installation packages also come with an IDE called the R GUI. It is sufficiently powerful to allow anyone to get started, while missing many convenient features of RStudio. As data science has matured, additional tools are now available. In this book, I will focus on using R via the powerful and popular IDE for R called RStudio.

1.2.2 RStudio

Software development and engineering tools have matured over the years. With the arrival of high-speed hardware, there has been a revival of interpreted languages, like R. Interpreted languages allow development environments extra flexibility by using real-time interpretation to assist the programmer while writing a program. Since R is intended primarily as a statistical analysis language, a good IDE makes it easy to write and test code, see plots, and examine data. RStudio extends the concept of the IDE by integrating with the powerful report-generating packages used for reproducible research. RStudio is free (as in open and beer), and the RStudio team has shown itself to be a dedicated contributor to R, supporting some of the most important packages, including tidyverse and many others.

There are versions of RStudio for Windows, Mac, and many Linux distributions, and since it is open source, you can build it yourself from the source code if there is no binary distribution for your OS. To get started, go to the RStudio website (rstudio.com) and select the download for your machine. Each binary has an installer with instructions. Once you have RStudio installed, you can run it, and you should see something like Figure 1.1.

Figure 1.1 RStudio startup interface.

1.2.3 Base R

Much of the power of R comes from the large collection of base libraries developed by the R community. Currently, there are 14 base packages and 15 recommended packages [23]. These packages allow users to perform various statistical analyses and data visualization. The various distributions of R incorporate the base and recommended packages. Over time, as new packages are developed, they are usually shared using a repository called the CRAN, which was created to make them available to the R community. The recommended packages come from CRAN and are installed with most distributions of R. Together, the base and recommended packages have become known as Base R, which is usually sufficient for most statistical analysis and data plotting tasks.

Base R provides the mechanisms for essential data manipulation on several fundamental data types. Beyond scalar variables, base R allows you to manipulate vectors, sequences, matrices, lists, and strings. Probably the most significant data type provided in base R is the data.frame. A data.frame is a rectangular table where each column is assigned a data type and can have a name. Each row can also be named, even if the name is simply a row number. What makes Data Frames (and other newer data types derived from the Data Frame) powerful is that data can be manipulated and selected with conditional statements. The syntax of Data Frame operations can be slightly confusing, but learning it allows you to work with data in R in ways that are much easier than most programming languages.

1.2.4 Basics of Data Frames

To demonstrate how to use the data.frame, I’ve extracted a part of the Human Metabolite Database [24–28] into a CSV file. CSV files are simple text files that usually contain the column names in the first row. The base R function to read a CSV file is conveniently named read.csv().

hmdb_df

<-

read.csv

(

file.path

(

"data"

,

"hmdb_urine_metabolites.csv"

))

str

(hmdb_df,

width=

72

,

strict.width=

"cut"

)

## 'data.frame': 4692 obs. of 6 variables:

## $ accession: chr "HMDB0000001" "HMDB0000002" "HMDB0000005" "HMDB000"..

## $ name : chr "1-Methylhistidine" "1,3-Diaminopropane" "2-Ketobu"..

## $ formula : chr "C7H11N3O2" "C3H10N2" "C4H6O3" "C4H8O3" ...

## $ exact_mw : num 169.1 74.1 102 104 300.2 ...

## $ smiles : chr "CN1C=NC(C[C@H](N)C(O)=O)=C1" "NCCCN" "CCC(=O)C(O)"..

## $ status : chr "quantified" "quantified" "quantified" "quantified"..

The str() function shows the structure of any R object, and in this case, it shows that nmdb_df is a data.frame with 4692 rows (observations) and 6 variables (columns). The columns have both names and types. Here the column names are given next to the $ symbol, followed by the data type for that row. The str() function also shows a sample of the data in each column.

Once data is in a data.frame, there are several ways to access specific elements, depending on your needs. For example, you can access data by row, column, or specify both.

Figure 1.2 shows the syntax to access elements of a data.frame.

Figure 1.2 Accessing elements of a data.frame.

A data.frame is a collection of rows and columns. Each column has a specific data type and can have a name. Each row can also have a name that can be used to access particular observations. Columns and rows can also be accessed by their index value, which is an integer number. In R, it’s important to remember that indexing always starts at 1 rather than 0, as in many other languages.

Accessing elements of a data.frame uses the square bracket notation: df[row,column]. In 1.2, you can see that selecting the element in the first row and the third column (the value of the chemical formula in the first row) is simply hmdb_df[1,3]. Besides an index value, the third row has the name formula. R can use the $ symbol to access a column’s name. hmdb_df$accession returns an array of values from the first column, and hmdb_df$formula returns an array of values from the third column. The first elements of the array returned in case of formula can be accessed with the same [] notation so that hmdb_df$formula[1] returns the first element in the formula column. The ability to specify columns by name is helpful when accessing data by name, and access by index is helpful when using numeric loops to access each column. To access all the data in the first column, you just leave the row element empty: hmdb_df[,1]. Leaving the row or column value blank returns all the elements, so this statement returns all the rows from column 1.

In Figure 1.2, when single integers are used for rows or columns, what is returned is a vector of the column’s data type with no name.

head

(hmdb_df[,

1

])

## [1] "HMDB0000001" "HMDB0000002" "HMDB0000005" "HMDB0000008" "HMDB0000010"

## [6] "HMDB0000011"

class

(hmdb_df[,

1

])

## [1] "character"

The extract operator $ also returns a vector for the name given:

head

(hmdb_df

$

accession)

## [1] "HMDB0000001" "HMDB0000002" "HMDB0000005" "HMDB0000008" "HMDB0000010"

## [6] "HMDB0000011"

The sequence operator : can also be used in the row and column position to specify a range of rows or columns to be returned. One important thing to notice is that when using the : operator, the class returned is a data.frame rather than a vector.

head

(hmdb_df[

1

:

4

,])

## accession name formula exact_mw

## 1 HMDB0000001 1-Methylhistidine C7H11N3O2 169.0851

## 2 HMDB0000002 1,3-Diaminopropane C3H10N2 74.0844

## 3 HMDB0000005 2-Ketobutyric acid C4H6O3 102.0317

## 4 HMDB0000008 2-Hydroxybutyric acid C4H8O3 104.0473

## smiles status

## 1 CN1C=NC(C[C@H](N)C(O)=O)=C1 quantified

## 2 NCCCN quantified

## 3 CCC(=O)C(O)=O quantified

## 4 CC[C@H](O)C(O)=O quantified

class

(hmdb_df[

1

:

4

,])

## [1] "data.frame"

Another way to return a subset of a data.frame as a data.frame is to combine the [] operator with a string name of the column:

head

(hmdb_df[

"exact_mw"

])

## exact_mw

## 1 169.0851

## 2 74.0844

## 3 102.0317

## 4 104.0473

## 5 300.1725

## 6 104.0473

class

(hmdb_df[

"exact_mw"

])

## [1] "data.frame"

One of the most powerful aspects of the [] operator in R is that a boolean vector can be used in place of the sequence generated by the : operator. Using boolean vectors instead of numeric sequences allows subsetting based on conditional statements:

hmdb_df[hmdb_df[

"formula"

]

"C4H8O3"

,]

## accession name formula exact_mw smiles

## 4 HMDB0000008 2-Hydroxybutyric acid C4H8O3 104.0473 CC[C@H](O)C(O)=O

## 6 HMDB0000011 3-Hydroxybutyric acid C4H8O3 104.0473 C[C@@H](O)CC(O)=O

## 14 HMDB0000023 (S)-3-Hydroxyisobutyric acid C4H8O3 104.0473 C[C@@H](CO)C(O)=O

## 188 HMDB0000336 (R)-3-Hydroxyisobutyric acid C4H8O3 104.0473 C[C@H](CO)C(O)=O

## 228 HMDB0000442 (S)-3-Hydroxybutyric acid C4H8O3 104.0473 C[C@H](O)CC(O)=O

## 358 HMDB0000710 4-Hydroxybutyric acid C4H8O3 104.0473 OCCCC(O)=O

## 367 HMDB0000729 alpha-Hydroxyisobutyric acid C4H8O3 104.0473 CC(C)(O)C(O)=O

## status

## 4 quantified

## 6 quantified

## 14 quantified

## 188 quantified

## 228 quantified

## 358 quantified

## 367 quantified

Notice that the row component of the [row,column] statement is a boolean vector with a value of TRUE for every row in which the formula column value is equal to the string “C4H8O3.” Again by leaving the row specification empty, the statement returns all the matching rows. Since the row specification was not a single value but a sequence, a new data.frame is returned. Using conditional statements to subset a data.frame is very powerful, but complex filtering can require convoluted conditional statements, which are hard to debug. The next section introduces a more modern method for subsetting based on filtering.

There is much more to subsetting than the basics I’ve shown here. You can find details in the “Subsetting” chapter in Wickham’s excellent text: Advanced R [29].

1.2.5 The Tidyverse

A relatively new approach to managing data in R introduced a notable advance in R programming called “Tidy Data.” Programs that follow tidy data principles manage and access data robustly, similar to modern database techniques.

The tidyverse is a collection of R packages intended to make it easier to write readable R code and support reproducible research. It was released in 2016 and is described in R For Data Science [30] and is available free at https://r4ds.had.co.nz/. The tidyverse team continually updates the package, and the project enjoys broad support from the R community.

To use the tidyverse, simply load it using the library() function:

library

(tidyverse)

Loading the tidyverse metapackage adds the packages ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, and forcats.

Notice that dplyr masks two functions from the stats package: filter() and lag(). The filter() function in dplyr is central to how the tidyverse performs data subsetting. The lag() function in dplyr creates a new vector offset from a given vector by one element. When functions with the same name are loaded from different packages, you can specify the specific function by naming the package and using the :: operator. The filter() function is probably one of the most often used function names when using R packages useful for mass spectrometry, because the verb filter has many different meanings in this domain. In some cases, you want to filter rows in a table. Other meanings include filtering a range of m/z values or retention times. It can also mean applying a signal processing filter to a dataset. It is often necessary to access the specific version of filter() by putting the package name first, like stats::filter(). As more libraries are used, it is more important to specify the package name.

1.2.6 Tidy Data

The main idea behind tidy data is that in a table, each column represents a variable, and each row represents an observation. You can create objects in which every element is of a different type, but that data is difficult to deal with. In the tidyverse, if you need to represent observations with a variable that is duplicated across many observations, you could add a new variable (column), but it might be a sign that you should create a new table. Rather than duplicating observations two tables can be constructed and then related by a unique, shared variable. In database management, this approach is called normalization, which aims to remove duplicate entries by using multiple tables. There are situations where you don’t want highly normalized data in multiple tables. For example, systems like data warehouses or data lakes use a single table with a high degree of duplication to simplify filtering, grouping, and aggregation by avoiding joining multiple tables. The tidyverse team has explicitly declared that it is an opinionated project with clearly stated ideas about what constitutes correct data management. The functions in the dplyr package, which provides the tools to manage data, are designed with normalized, tidy data in mind. However, many application-specific packages in R do not follow the tidy data principles. In this book, I will focus on tidying the data returned by many application-specific functions, so your data is easier to understand and manipulate. A focus on tidy data will also have the effect of making your data more compatible with the other projects, like Tidymodels, which has grown up around the tidyverse ecosystem.

1.2.7 The tibble: An Improved data.frame

The data.frame is an incredibly useful data structure and is one reason data analysis in R is superior to using spreadsheets. However, when analyzing complex data, the data access and manipulation syntax can sometimes become hard to read. One of the goals of the tidyverse approach is to use functions to perform data organization and manipulation at a higher level, rather than having to resort to low-level base R syntax. Central to the tidyverse metapackage is the dplyr (pronounced de-plyer) that implements tidy manipulation functions. The dplyr package relies on a modernized version of the data.frame class called tibble from the tibble package. A tibble is a subclass of the data.frame class, which means tibbles are data.frames, and anything that works with a data.frame also works with a tibble. However, a tibble is a simplified version of the data.frame class, which for example, doesn’t use row names, among other internal data representation changes. The tibble class overloads many of the data.frame functions. The result is that the default behavior of a tibble is, in some cases, quite different from the data.frame parent class.

Data can be read directly into a tibble from a file, or you can create a tibble from an existing data.frame.

hmdb

<-

as_tibble

(hmdb_df)

print

(hmdb)

## # A tibble: 4,692 x 6

## accession name formula exact_mw smiles status

## <chr> <chr> <chr> <dbl> <chr> <chr>

## 1 HMDB0000001 1-Methylhistidine C7H11N3O2 169. CN1C=NC(C[C@H](N~ quant~

## 2 HMDB0000002 1,3-Diaminopropane C3H10N2 74.1 NCCCN quant~

## 3 HMDB0000005 2-Ketobutyric acid C4H6O3 102. CCC(=O)C(O)=O quant~

## 4 HMDB0000008 2-Hydroxybutyric acid C4H8O3 104. CC[C@H](O)C(O)=O quant~

## 5 HMDB0000010 2-Methoxyestrone C19H24O3 300. [H][C@@]12CCC(=O~ quant~

## 6 HMDB0000011 3-Hydroxybutyric acid C4H8O3 104. C[C@@H](O)CC(O)=O quant~

## 7 HMDB0000012 Deoxyuridine C9H12N2O5 228. OC[C@H]1O[C@H](C~ quant~

## 8 HMDB0000014 Deoxycytidine C9H13N3O4 227. NC1=NC(=O)N(C=C1~ quant~

## 9 HMDB0000015 Cortexolone C21H30O4 346. [H][C@@]12CC[C@]~ quant~

## 10 HMDB0000017 4-Pyridoxic acid C8H9NO4 183. CC1=NC=C(CO)C(C(~ quant~

## # i 4,682 more rows

The first observable difference between a tibble and a data.frame is the print(). In addition to limiting the default output to 10 rows, print() gives extra data about the shape, and column types. In a tibble, variables (columns) still have names, but rows (observations) do not.

The dplyr package provides all the subsetting and manipulation functions needed to work with a tibble. Idiomatic tidyverse programming using tibble generally avoids the [] selection operator but like many idioms, this convention is often ignored and many programs move between tibbles and data.frames without strict adherence to tidyverse conventions. This blending of styles allows flexibility, as tibble objects are compatible with base R data frame operations, but it can sometimes lead to confusion or unexpected behavior when functions treat tibble objects differently from traditional data frames.

To perform the selection of the first column as a vector the pull() function is used with the column number:

head

(

pull

(hmdb,

1

))

## [1] "HMDB0000001" "HMDB0000002" "HMDB0000005" "HMDB0000008" "HMDB0000010"

## [6] "HMDB0000011"

The same output can be obtained using the variable name:

head

(

pull

(hmdb, accession))

To extract the first column as a tibble, the dplyr::select() function is used:

dplyr

::

select

(hmdb, exact_mw)

## # A tibble: 4,692 x 1

## exact_mw

## <dbl>

## 1 169.

## 2 74.1

## 3 102.