Biostatistics with Python - Darko Medin - E-Book

Biostatistics with Python E-Book

Darko Medin

0,0
25,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This book leverages the author’s decade-long experience in biostatistics and data science to simplify the practical use of biostatistics with Python. The chapters show you how to clean and describe your data effectively, setting a solid foundation for accurate analysis and proficiency in biostatistical inference to help you draw meaningful conclusions from your data through hypothesis testing and effect size analysis.
The book walks you through predictive modeling to harness the power of Python to create robust predictive analytics that can drive your research and professional projects forward. You'll explore clinical biostatistics, learn how to design studies, conduct survival analysis, and synthesize evidence from multiple studies with meta-analysis – skills that are crucial for making informed decisions based on comprehensive data reviews. The concluding chapters will enhance your ability to analyze biological variables, enabling you to perform detailed and accurate data analysis for biological research. This book's unique blend of biostatistics and Python helps you find practical solutions that make complex concepts easy to grasp and apply.
By the end of this biostatistics book, you’ll have moved from theoretical knowledge to practical experience, allowing you to perform biostatistical analysis confidently and accurately.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 388

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Biostatistics with Python

Apply Python for biostatistics with hands-on biomedical and biotechnology projects

Darko Medin

Biostatistics with Python

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

The author acknowledges the use of cutting-edge AI, such as ChatGPT, with the sole aim of enhancing the language and clarity within the book, thereby ensuring a smooth reading experience for readers. It’s important to note that the content itself has been crafted by the author and edited by a professional publishing team.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Associate Group Product Manager: Niranjan Naikwadi

Publishing Product Managers: Sanjana Gupta and Yasir Khan

Book Project Manager: Shambhavi Mishra

Senior Editor: Tiksha Lad

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Proofreader: Tiksha Lad

Indexer: Hemangini Bari

Production Designer: Prashant Ghare

Senior DevRel Marketing Executive: Vinishka Kalra

First published: November 2024

Production reference: 1221124

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-83763-096-7

www.packtpub.com

Contributors

About the author

Darko Medin is a researcher and a biostatistician who graduated from the Faculty of Mathematics and Natural Sciences, Experimental Biology and Biotechnology, University of Montenegro. Darko is an expert biostatistician, especially in the fields of research and development in the biotech and pharma industries. He is a Python-based data scientist with more than 10 years of experience in the areas of clinical biostatistics and biomedical research. As a biologist and data scientist, he has worked with many research companies and academic institutions around the world and is an experienced machine learning and AI developer.

About the reviewers

Meghal Gandhi is currently a software engineer and machine learning researcher at Charles R. Drew University of Medicine and Science based in Los Angeles. He holds a master’s in computer science from California State University, Fullerton. While working in AI in healthcare, he built machine learning and deep learning models to predict the risk of getting diseases based on medical records. His research work has been published in prestigious medical journals and conferences. Prior to this, he worked as a software engineer on telecommunication and performance engineering projects at AT&T.

Russell Reeve, PhD, is VP and global head of biostatistics at Syneos Health and was formerly VP and global head of biostatistics at IQVIA. He provides advice on clinical trial design and analysis, is responsible for delivery, and leads machine learning initiatives. At IQVIA, Dr. Reeve led the development of three successful SaaS products, including a machine learning solution for subgroup identification and synthetic patient generation. Dr. Reeve has provided trial design advice for hundreds of trials, including for adaptive and platform trials, and has over 30 peer-reviewed papers. He has presented over 50 conference presentations on his research. Dr. Reeve received his doctorate in statistics from Virginia Polytechnic Institute and State University.

Table of Contents

Preface

Part 1: Introduction to Biostatistics and Getting Started with Python

1

Introduction to Biostatistics

Why do we need biostatistics in life sciences?

Biostatistics in human life sciences

Biostatistics for biology

Biostatistics in epidemiology and public health

Biostatistics in medicine and biomedical research

Biostatics in zoology and botany

Biostatistics in ecology

Biostatistics in pharmaceutical research and design

Biostatistics in bioinformatics and genetics

Formulating the scientific questions in life sciences and research

How to formulate scientific questions related to diabetes

How to formulate scientific questions related to cardiovascular disease

How to formulate scientific questions in biology

How computation can help answer different questions in life sciences

Biostatistics and Python

Answers for Chapter 1

Summary

2

Getting Started with Python for Biostatistics

Launching Jupyter Notebook and navigating its interfaces

Using Jupyter Notebook to write Python code and a brief introduction to programming

Launching the Spyder IDE and using its interfaces

Selecting and running code in Spyder

Installing packages in Python

Loading data in Python – how to load the Iris dataset

Exploring the Iris dataset

Exploring the data and the associated variable names

Summary

3

Exercise 1 – Cleaning and Describing Data Using Python

Technical requirements

Data types

Terms and metrics in EDA

Loading the Exercise 1 data using Python

Cleaning missing values and invalid data

Finding NaN values and invalid data types and addressing them

Identifying the wrong species name

Performing descriptive statistics analysis in Python

Continuous and discrete distributions

Visualizing the Iris data

Summary

4

Part 1 Exemplar Project – Load, Clean, and Describe Diabetes Data in Python

Loading and examining the Diabetes dataset

Validating and describing the Diabetes dataset

A more detailed grouping for descriptive statistics

Creating the data visualizations and table outputs

Exploring the HDL levels across different groups (N and Y classes)

Another type of visualization – Seaborn scatter plot

Data visualization using boxplots

Summary

Part 2: Introduction to Python for Biostatistics – Methodology and Examples

5

Introduction to Python for Biostatistics

Libraries for biostatistics hypothesis tests in Python

The underlying principles of p-values

Performing tests in Python

Libraries for predictive biostatistics in Python

Choosing which method to use for answering different scientific or research questions

Summary

6

Biostatistical Inference Using Hypothesis Tests and Effect Sizes

Technical requirements

Performing Student’s t-test in Python and interpreting the effect sizes

How does the t-test work?

Performing Wilcoxon signed-rank test in Python

Performing chi-squared tests in Python

Analyzing associations among multiple variables – correlations in Python

Analyzing multiple groups in Python – ANOVA and Kruskal–Wallis test

Summary

7

Predictive Biostatistics Using Python

Learning predictive biostatistics and their uses in different areas of life science

Dependent and independent variables

Linear regression for biostatistics in Python

Logistic regression in Python

Multiple linear and logistic regressions using Python

Summary

8

Part 2 Exercise – T-Test, ANOVA, and Linear and Logistic Regression

Implementing different versions of Student’s t-test

Applying post-hoc tests using ANOVA

Performing and visualizing linear regression in Python

Performing and visualizing logistic regression in Python

Summary

9

Biostatistical Inference and Predictive Analytics Using Cardiovascular Study Data

Technical requirements

The Cleveland dataset

Loading and examining the cardiovascular data in Python

Hypothesis tests applied to evaluate mean differences

The main research questions

Linear regression for cardiovascular predictive analysis

Using logistic regression to derive odds ratios for categorical variables

Summary

Part 3: Clinical Study Design, Analysis, and Synthesizing Evidence

10

Clinical Study Design

Understanding clinical studies and their relationship with biostatistics

Clinical study design and research questions

Learning about the principles of clinical trials

Reporting in clinical trials

Phase I clinical trials

Phase II clinical trials

Phase III clinical trials

Phase IV clinical trials

Calculating sample size for clinical studies

Defining the protocols for clinical studies

Summary

11

Survival Analysis in Biomedical Research

Understanding survival analysis and how is it used in biomedical research

Creating Kaplan-Meier curves in Python

Implementing Cox (proportional gazards) regression in Python

Summary

12

Meta-Analysis – Synthesizing Evidence from Multiple Studies

Understanding meta-analysis and synthesizing evidence from multiple studies

Meta-analysis method structure

Understanding random effects meta-analysis and fixed effects in meta-analysis

Meta-analysis estimators

Exploring and learning meta-regression and which packages to use for its implementation in Python

Learning how to interpret meta-analysis

Interpreting forest plots

How to interpret publication bias analysis

How to interpret the sensitivity analysis

Assessing the quality of the studies

Making final conclusions in a meta-analysis

Summary

13

Survival Predictive Analysis and Meta-Analysis Practice

Understanding survival and meta-analysis data

Meta-analysis and survival data

Implementing the DerSimonian and Laird inverse variance method and investigating heterogeneity in meta-analysis

Plotting the forest plots and funnel plots for meta-analysis

The subgroup analysis

Mastering meta-regression

Summary

14

Part 3 Exemplar Project – Meta-Analysis of Survival Data in Clinical Research

About the project and the dataset

Implementing DerSimonian and Laird inverse variance method in Python

Making forest plots for oncology meta-analysis

Making funnel plots – publication bias analysis

Implementing the Mantel-Haenszel estimator in a Meta-analysis

Summary

Part 4: Biological and Statistical Variables and Frameworks, and a Final Practical Project from the Field of Biology

15

Understanding Biological Variables

Understanding biological variables and experiments

Practical examples of defining biological variables and associating them with statistics

Confounders and latent variables in biology research

Validating biological data

Summary

16

Data Analysis Frameworks and Performance for Life Sciences Research

Creating biology study designs

Understanding the statistical frameworks

Learning the Frequentist framework statistics

Learning the Bayesian framework statistics

Choosing a Statistical framework

Latent variables and Causal inference

Counterfactual design

Randomized controlled trials

Sensitivity and how to how to interpret biological data analysis

Trustworthiness

Magnitude perspective of the results and Biological context

The overall scientific value of the result

Novelty value of a result

Summary

17

Part 4 Exercise – Performing Statistics for Biology Studies in Python

Understanding data dimensionality and resolving data complexity

Learning how to identify latent factors

Summary

Index

Other Books You May Enjoy

Preface

Unlike other books aimed at biostatisticians, this book is written for all those who wish to use the power of Python for biostatistics in their respective fields. Python has become the most used programming language, adopted in almost all segments of the biotech industry, the medical and pharma research industries, and academia, due to its ability to be integrated with artificial intelligence and machine learning. For this reason, implementing biostatistics with Python is one of the most important fields of research today. This book capitalizes on Python’s strengths and scalability to augment and improve the researcher’s toolbox, helping anyone in the life sciences and biostatistics fields. This book is one of the rarest resources for this book’s audience.

This book provides a comprehensive guide to combining Python programming with biostatistics for applications in life sciences, biotech, and AI-driven fields. It offers real-world projects and examples from oncology, cardiology, biology, and biotech, making learning practical and relevant. The book integrates the biological, data science, and statistical domains with coding, catering to both novices and experienced programmers. Python’s scalability and efficiency make it an invaluable tool for biotech, clinical, life sciences, and bioinformatics professionals, enabling the automation of data processing and analysis tasks while significantly reducing time and effort.

You will gain the modern programming skills necessary to perform complex statistical analyses and connect your Python expertise to cutting-edge fields such as artificial intelligence, machine learning, and digital product creation. The book equips you with domain-specific biostatistics knowledge tailored to life sciences and biotech, eliminating the need to learn additional programming languages. It also empowers AI developers, software engineers, and digital product creators to evaluate AI models, test results, and deliver insightful analyses within the Python ecosystem.

By bridging Python programming and biostatistics, Biostatistics with Python offers a well-structured approach to mastering essential statistical concepts, unlocking powerful applications across numerous scientific and technological domains.

Who this book is for

This book is designed for everyone in the fields life sciences, biodata science, biotech, and Python programming fields. Here are the main audience groups that may be interested in this book:

Biologists with an interest in using Python capabilities: Biology researchers who require a robust statistical programming language and are looking to integrate biology, data science, and statistics to analyze experimental data and Python’s capabilitiesPython programmers entering life sciences: Software developers, engineers, data scientists, and analysts who want to use Python for biostatistics, as well as academics and researchers in computational fieldsPython-based data analysts interested in biostatistics: Analysts using Python who wish to specialize in biostatistics and life sciencesDoctors and medical researchers: Medical professionals involved in clinical research, cardiology, and oncology who need to perform complex analyses, study disease patterns, and evaluate treatment efficacy in PythonData scientists in biotech: Individuals engaged in drug target discovery and drug development who utilize statistical methods to design clinical trials, analyze pharma data, and optimize biostatistics that could be integrated with machine learning and AI in the futureAI and machine learning specialists in life sciences: Professionals from the AI and machine learning sectors in life sciences research who use biostatistical approaches to evaluate the effectiveness of AI/machine learning products in PythonBioinformaticians with an interest in biostatistics: Experts handling bioinformatics data who need biostatistical methods to interpret complex datasets and derive meaningful biological insights in PythonComputational biologists with an interest in biostatistics: Computational biologists who require Python proficiency in biostatistics to deal with complex datasets and use efficient, scalable, and reproducible methods for data analysis in PythonHobbyists and enthusiasts: Anyone with a passion for Python programming and biology who is seeking to expand their knowledge and apply Python to biostatistical concepts and projects

What this book covers

Chapter 1, Introduction to Biostatistics, introduces the field of biostatistics and its use cases. You will learn why biostatistics is important for biomedicine, clinical trials, biology, biotechnology, and life sciences. You will also understand why it’s important to use computational programming languages such as Python to process biological and biomedical data and try to answer the research questions we may have.

This chapter lays the theoretical foundation needed to proceed with the use of biostatistics in life sciences fields and understand how Python can be used for biostatistical analysis within the biotech and life sciences research fields. You will need this understanding to proceed with the hands-on projects in the following chapters.

Chapter 2, Getting Started with Python for Biostatistics, is about facilitating Python installation and getting started with Python such as Spyder IDE and Jupyter Notebook. You will learn how to install Python and its IDEs using the open source Anaconda distribution. Finally, you will learn how to navigate the interfaces of IDEs.

Chapter 3, Exercise 1 – Cleaning and Describing Data Using Python, will help you learn more about the basics of data science, including data types, how to load data in Python, and more. Practical exercises for loading the famous Iris dataset, cleaning data, and describing data are among the topics in this chapter. The chapter prepares you for the next exercise on diabetes data. It introduces the concept of Exploratory Data Analysis (EDA), which will be used in many other chapters of the book.

Chapter 4, Part 1 Exemplar Project – Load, Clean, and Describe Diabetes Data in Python, is where you will apply what you learned in Chapter 3. The dataset is the Pima Indians Diabetes dataset. EDA, cleaning, and data visualizations as output are the practical goals of this chapter. The chapter also covers theoretical aspects associated with the dataset, such as the theoretical foundation of diabetes mellitus biomarkers.

Chapter 5, Introduction to Python for Biostatistics, covers the libraries used for specific biostatistical methods and how those methods work. You will learn about the libraries for hypothesis tests, effect size analysis, predictive analysis, and more. Toward the end of the chapter, you will learn how to select specific hypothesis tests and biostatistical implementations for different research questions. The main goal of the chapter is to introduce you to the Python framework for biostatistical analysis.

Chapter 6, Biostatistical Inference Using Hypothesis Tests and Effect Sizes, is on biostatistical inference. How to apply hypothesis tests such as Student’s t-test, the Wilcoxon test, and the Chi-square test is covered. Another topic covered is finding the associations between variables using correlation analysis. ANOVA and Kruskal–Wallis tests are explored, as is how to analyze multiple groups using Python.

Chapter 7, Predictive Biostatistics Using Python, looks at predictive biostatistics and its uses in different areas of biology, biomedicine, and other life sciences fields.

You will learn about different types of variables in relation to predictive analysis, such as dependent variables, independent variables, and latent variables. You will learn how to implement linear regression and logistic regression in Python. Finally, you will learn how to create and interpret multivariable regression models.

Chapter 8, Part 2 Exercise – T-Test, ANOVA, and Linear and Logistic Regression, is mostly about practical exercises in Python hypothesis testing and predictive analysis. You will learn how to implement Student’s t-test for comparing two groups and Analysis of Variance (ANOVA) to compare multiple groups in biological data. In this chapter, you will also learn how to practically define, create, and implement linear and logistic regression models using Python. At the end of each analysis, you will also learn how to create a publication-ready and intuitive data visualization using Python’s data visualization libraries.

Chapter 9, Biostatistical Inference and Predictive Analytics Using Cardiovascular Study Data, contains an exemplar project based on the Cleveland Heart Disease dataset. The main focus of this chapter is the practical implementation of biostatistical inference and predictive analytics in cardiology. The chapter includes both biological and statistical aspects of a practical cardiology project in the field of biostatistics. Hypothesis tests and linear and logistic regression are this time applied to a cardiovascular dataset, including cardiovascular disease modeling.

Chapter 10, Clinical Study Design, looks at how one of the most important aspects of any biostatistics project is the study design. In this chapter, the main topic is understanding clinical studies from the design perspective. You will understand the principles for observational studies, including cohort and case control studies, but also different designs of clinical trials. Furthermore, you will learn how to add sample size calculation for study planning and design. Finally, you will learn how to define protocol documentation for clinical studies.

Chapter 11, Survival Analysis in Biomedical Research, will see you start by loading and understanding an oncology dataset (Veterans Oncology dataset) using scikit-learn. Then, you will how to use survival analysis and Kaplan-Meier (KM) curves to visualize and analyze survival in different groups of oncology patients. You will learn how to implement Cox Proportional Hazards regression models to perform survival analysis inference and identify the appropriate oncology survival models in the data.

Chapter 12, Meta-Analysis – Synthesizing Evidence from Multiple Studies, shows you how to synthesize evidence from multiple studies or analyses. This chapter lays the theoretical foundation to help you understand how to use meta-analysis to synthesize evidence from multiple studies and create overall estimates of treatment effects in biostatistics. You will learn the differences between random and fixed-effects meta-analysis models and when to use them. You will learn how to reason about and interpret forest and funnel plots, which are often the main focus of meta-analysis interpretation and data visualization.

Chapter 13, Survival Predictive Analysis and Meta-Analysis Practice, is about the practical implementation of meta-analysis code in Python. You will be using the PythonMeta package and the DerSimonian & Laird inverse variance method. You will learn about Overall Survival (OS), Progression-Free Survival (PFS), Disease-Free Survival (DFS), and Recurrence-Free Survival (RFS) metrics in oncology meta-analysis. Finally, the main outcome of the chapter is being able to practically implement meta-analysis and visualize and interpret results using Python.

Chapter 14, Part 3 Exemplar Project – Meta-Analysis of Survival Data in Clinical Research, starts with Non-Small Cell Lung Cancer (NSCLC) dataset and the treatment used to target a specific molecule associated with this cancer called Tyrosine Kinase Inhibitors (TKI). The project involves performing a real-world meta-analysis with data from real studies. This exemplar project is a simulation of a real-world oncology meta-analysis, all done using the powerful Python programming language.

Chapter 15, Understanding Biological Variables, looks at simplifying the complexity of biological systems by focusing on the observation and analysis of key variables. It explores latent variables and provides detailed guidance on selecting significant variables for biodata analysis. You will learn how to connect biological questions with observable variables, ensuring the meaningful interpretation of data. The chapter concludes with techniques for validating the biological relevance of data, reinforcing the connection between theory and practical application.

Chapter 16, Data Analysis Frameworks and Performance for Life Sciences Research, focuses on learning to differentiate different statistical data analysis frameworks. We discuss the frequentist and Bayesian frameworks, their differences, when to use them, and how to apply them to different research problems. You will also learn how to choose the correct statistical framework for your analysis. Finally, you will learn how to connect an experiment design with the statistical aspects of the analysis and perform in-depth interpretation of the results based on the statistical framework you choose.

Chapter 17, Part 4 Exercise – Performing Statistics for Biology Studies in Python, contains a state-of-the-art biology research exemplar project for you to use Python programming and advanced statistical approaches. You start with the mice proteomics dataset, in which we explore the biological aspects of neuroscience and proteins associated with different conditions. The approaches included are Principal Component Analysis (PCA), Random Forest (RF) for feature selection, and Structural Equation Modeling (SEM). By the end of the chapter, you will know how to create protein association SEM models and how to relate biological domain knowledge with latent variables to perform protein SEM analysis and statistically test biological pathways understood using the theoretical domains of molecular biology.

To get the most out of this book

To get the most out of this book, you don’t need any previous knowledge about biostatistics or Python. What’s most important is to install Python, Spyder IDE, and Jupyter Notebook using the Anaconda navigator. Other ways of installation may also work but will probably be more complex.

Software/hardware covered in the book

Operating system requirements

Python 3.8 or above

Windows, macOS, or Linux

Spyder IDE

Windows, macOS, or Linux

Anaconda Distribution (contains all of the above)

Windows, macOS, or Linux

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Remember, anyone can learn the topics in this book, as long as an interest in biotech, biomedical research, other life sciences areas, and the use of Python for biostatistics is present.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Biostatistics-with-Python. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “We will be looking at CLASS=='Y'versus CLASS=='N'”.

A block of code is set as follows:

#Load Iris dataset as a csv file data=pd.read_csv(r'C:\Users\MEDIN\Desktop\Iris.csv')

The # sign is used to comment code; such lines of code will not be run by Python. The paths in the book may be replaced by the paths on your computer. In this example, C:\Users\MEDIN\Desktop may be replaced by \Users\path\Desktop or another path where files of interest for the book may be located.

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Biostatistics with Python, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/978-1-83763-096-7

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1:Introduction to Biostatistics and Getting Started with Python

In Part 1 of the book, you will be introduced to biostatistics and Python. This part is on learning about biostatistics use cases in biomedical, biotech, and pharma research. Furthermore, you will learn how to install Python using Anaconda and use it along with Spyder IDE and Jupyter Notebook. Finally, you will start using Python with a hands-on exemplar mini-project using diabetes data.

This part has the following chapters:

Chapter 1, Introduction to BiostatisticsChapter 2, Getting Started with Python for BiostatisticsChapter 3, Exercise 1 – Cleaning and Describing Data Using PythonChapter 4, Part 1 Exemplar Project – Load, Clean, and Describe Diabetes Data in Python

1

Introduction to Biostatistics

Welcome to the world of biostatistics. This book will guide you through the principles and practical examples of biostatistics and you will go through a portfolio of exemplar projects with real-world data and learn how to use one of the most advanced programming languages today: Python.

Biostatistics is one of the most important science disciplines today; it enables research, is the foundation of most life sciences, and is growing as a key factor in many industries today, from pharmaceuticals to medicine, biology, and many other life sciences. This chapter explains why biostatistics is important for different areas of biomedicine, clinical trials, biology, and life science areas.

In this chapter, we’re going to cover the following main topics:

Understanding the need for biostatistics in life sciencesFormulating the scientific questions in life sciences and researchHow statistics and computation can help answer different questions in life sciences

At the end of this chapter, you will have a better understanding of the principles that make biostatistics the foundation of life science and what the advantages of using Python exemplar projects for biostatistics are.

Why do we need biostatistics in life sciences?

Life sciences are some of the most important fields of science today. Throughout the disciplines of biology, biomedicine, and pharmaceutical sciences in pharmaceutical and biotech companies, biostatistics plays a key role. We use it to analyze the data from experiments, improve study designs, interpret the results of studies, and make decisions within all these areas of life science. Biostatistics is applicable in all of these areas, and more, because it allows us to understand the underlying processes that you may be investigating.

While biostatisticians are essential in many areas, from biology and medicine to public health, understanding biostatistics is critical for other professionals in these areas, too.

If you are performing an experiment, conducting a study, or are interested in life science analytics, you will need to analyze the data to make conclusions or get insights from it. Biology and biomedical professionals will encounter biostatistics in most areas of their careers.

When reading almost any life science research publication, you will need to understand how to read biostatistics to understand the results. This is essential for both biologists and biomedical professionals who want to stay current with the latest research statistics for the pharmaceutical industry to discover biomarkers or therapies for patients.

Biostatistics enables us to understand and analyze the data or results we get from experiments, research, or observations. This is one of the reasons why the biostatistical field is important not only for biostatisticians, but also for doctors, biologists, epidemiologists, public health decision-makers, bioinformaticians, health data scientists, and other professionals from most life science branches.

The next subsection will help you understand the specific areas of life science where biostatistics is used. This is very important as every life science area is different and requires a different approach to resolve the research problems

Biostatistics in human life sciences

Biostatistics is essential in many human life sciences. Epidemiologists heavily rely on different types of data to infer their insights. Understanding statistical concepts is essential to understanding population-level biological events and helps both doctors and public health professionals in their work. One such example is the past SARS-CoV-2 pandemic. You must have heard about concepts such as reproductive number (R) or SARS-CoV-2 cumulative incidence of infections, mortality, lethality, excess deaths, and other similar terms. All these concepts are derived using biostatistical concepts and formulas.

Epidemiology is predominantly used in biomedical science areas by public health professionals to make decisions for disease response and to keep the population as safe as possible.

Figure 1.1 – Areas of human life science where biostatistics is used

Medical doctors need biostatistics, not only for their everyday work but also for publishing their academic work, which generally utilizes statistics to summarize and analyze the data of studies and to understand novel discoveries in their profession by interpreting study results from novel publications.

Biomedical research is one of the areas which is heavily reliant on biostatistics and knowing biostatistics is important, not only for statisticians but also for biomedical researchers. Even with access to expert biostatisticians, it is helpful to understand biostatistical thinking and analysis methodology to help with discussions on study design, analysis, and interpretation.

Pharmaceuticals, research, and the development of medications are among the largest industries today. The majority of advanced research in these areas is vital for the biomedical industry.

Biostatistics for biology

While statistics itself is used in many different areas of science, its application in biology has evolved in a specific way due to the nature of different biological domains. Statistics cannot be effectively applied without knowing the basic principles of these biological disciplines.

The following figure shows biostatistics applications in different areas of biology:

Figure 1.2 – Areas of biology where biostatistics is used

Bioinformatics relies on different statistical methods and algorithms combined with computational tools to process and analyze large amounts of biological data, such as RNA (ribonucleic acid) sequencing data (transcriptomics), DNA (deoxyribonucleic acid) data (genomics), and many other data types. Bioinformatics is specifically focused on genetics and molecular biology but implements methods such as biostatistics and machine learning.

Ecological studies are one of the examples where biostatistics is one of the main biological research drivers. Analyzing plant and animal populations, trends, dynamics, and relations between organisms and their environments would not be possible without biostatistics. Next, we will discuss biostatistics applications in different fields in more detail.

Biostatistics in epidemiology and public health

Epidemiologists and public health professionals answer some of the most important public health questions but also make decisions in different communities. They investigate diseases and events in smaller groups of people, cities, and countries, or even worldwide phenomena, such as pandemics. All this would not be possible without the use of biostatistics facilitating the process of analyzing the biomedical and population data.

Epidemiologists often create different statistical models to try to relate infectious outbreaks to causes and then prevent future infections and isolate the infection source. One such example is studying types of food ingested by infected individuals and identifying a potential bacterial or viral food source, or a location, such as a hotel or restaurant, as a source. Biostatistical models are often used in identifying the sources of infectious agents, which will be discussed in more detail in later chapters.

A few biostatistical concepts used in epidemiology and public health are as follows:

PrevalenceCumulative incidenceIdentifying causes for infectious outbreaksCharacteristics of microorganisms causing outbreaks in a populationEpidemiological monitoring of populationsDecision-making based on biostatistics

Biostatistics in medicine and biomedical research

Medicine and biomedical research are very active sciences today, as they directly or indirectly impact almost everyone’s life today. These two disciplines rely heavily on the use of biostatistics. It is of the essence not only for medical doctors but also for biomedical researchers.

Medical doctors’ understanding of the probability of different diseases or outcomes is highly dependent on understanding the statistical concepts and how these apply to groups of patients.

Here are some of the most important concepts used in biomedical research:

Understanding of incidence of safety for treatmentsMaking conclusions about the symptom and disease relationsCreating biomedical studiesAnalyzing biomedical dataInterpreting novel researchBecoming specialized in biomedical data analysis

Biostatics in zoology and botany

A significant portion of the research in biological disciplines, such as zoology and botany, depends on quantifying different aspects of their behavior, life cycles, relations with their environmental factors, and many other aspects.

Some examples of areas in zoology and botany that apply biostatistical methods are as follows:

Animal behaviorsPlant growthRelations between animals and their environmentRelations between plants and their environmentBiochemical composition of different tissues in animalsBiochemical composition of different tissues in plantsIdentifying feeding patterns in animals

Biostatistics in ecology

Ecology is one of the life science disciplines significantly based on biostatistics. Understanding the population’s diversity and the relationships between organisms, as well as the relationships between organisms and their environments, is facilitated using different biostatistical methods.

Some important areas of the use of biostatistics in ecology are as follows:

Relationships between animals and their environmentRelationships between plants and their environmentStudying biochemical and molecular aspects in zoology and botanyStudying relations between humans, ecology, and environmental protection

Biostatistics in pharmaceutical research and design

The pharmaceutical industry is one of the main drivers of research and innovation today. Biostatistical analyses enable pharmaceutical companies to design, conduct, and make decisions based on different analyses and insights. In fact, almost any high-quality research project in the pharmaceutical industry consults biostatisticians to make sure that the design is statistically sound and that it can answer the research questions to drive forward the development of assets and to conform with regulatory requirements. Biostatistics is also the key to analyzing adverse events from the data collected during a study, which is essential for any pharmaceutical product. All medications are required to have a list of adverse effects and this is something that can be seen in everyday life. Biostatistical calculation of incidence rates is one of the ways to assess those adverse effects.

Biostatistics is used to assess the efficacy of different therapies and, as such, is a key element in selecting the candidate drugs for diseases such as diabetes or cancer, which are then further evaluated in clinical trials using different biostatistical methods.

Calculating required sample sizes for pharmaceutical studies is a common task of biostatisticians within the pharmaceutical industry, but this is also intertwined with trial design and endpoint selection.

Here is a summary of the uses of biostatistics in pharmaceutical R&D:

Creating R&D studiesEvaluating drug safetySelecting drug candidates through biostatistical screeningDesigning clinical trialsEvaluating resultsResearch publicationsMeta-analyses of therapy effectsRegulatory submission

Biostatistics in bioinformatics and genetics

Molecular biology is one of the biological branches that is very specific in terms of using statistical analyses. From structural biology to analyzing gene expression, biostatistics plays one of the most important roles in bioinformatics. Statistical bases form many genetics areas, such as inheritance genetics and population genetics. Here are some of the areas of bioinformatics and genetics where biostatistics plays a pivotal role:

Differential gene expressionStructural biologyMutation biologyDNA analyticsMendelian inheritanceMendelian randomization studiesPopulation genetics

Formulating the scientific questions in life sciences and research

To be able to perform statistical analyses in life science and research, you will first need to learn how to address scientific questions in these areas. Scientific questions are a way to define what it is that we are trying to understand or what goal to achieve. In this chapter, you will learn by example how to formulate scientific questions related to various fields related to biostatistics, such as biomedical research, before any relevant statistical analysis is made. One of the first questions to answer is, “What is the goal of a statistical analysis?” This goal is closely related to different life science aspects, therapies, biological processes, or genetic characteristics, and in this section, those will be covered in more detail.

Once scientific questions are made, they are then used to formulate different scientific hypotheses. The main characteristic of any hypothesis is that it can be tested and there is an alternative (opposite) hypothesis to the main one. So, the baseline scenario assumption can be that there is no statistically significant result, and we can test the alternate scenario: that there is a significant result against the baseline or null scenario. We can call the null hypothesis H0 and the alternate hypothesis Ha.

How to formulate scientific questions related to diabetes

The effect of different lifestyles on the outcomes of type 2 diabetes mellitus has been debated for decades.

Let’s pose a couple of scientific questions about diabetes. We will use the letter Q for scientific questions:

Q1. Is body weight related to type 2 diabetes mellitus?Q2. Are there other risk factors for type 2 diabetes mellitus among those studies?Q3. Which of the lifestyle factors is the most important risk factor in type 2 diabetes mellitus?

Now, let’s formulate these questions even better. We will mark formulations using the letter F:

F1. Null hypothesis (H0): Body weight is not related to type 2 diabetes mellitus.

Alternate hypothesis (Ha): Body weight is related to type 2 diabetes mellitus.

F2. Null hypothesis (H0): There are no other risk factors for type 2 diabetes mellitus among those studied.

Alternate hypothesis (Ha): There are other risk factors for type 2 diabetes mellitus among those studied.

F3. This question will not have a null hypothesis as it is already assumed there are risk factors in the questions. So, the goal of answering this question is to compare the risk factors and identify the most important one. This would be an observational scientific question.

So, why do we usually formulate the null hypothesis as a negation of what’s being tested? Well, we want to know the following: Can I show evidence that contradicts that baseline negative assumption? If I can, then I can reject the null hypothesis. If there isn’t enough evidence to negate the null hypothesis, I can say that I cannot reject the null hypothesis (avoid the mistake of saying that no evidence is evidence of a null hypothesis).

How to formulate scientific questions related to cardiovascular disease

Is ST (the last wave on the electrocardiogram of the heartbeat) elevation closely related to heart disease? With this, we move to the following questions:

Q4. Do cigarettes increase the risk of cardiovascular diseases?Q5. Is an ECG closely related to cardiovascular disease?Q6. Are there any other risk factors for cardiovascular disease among the studied parameters?

Let us make a more structured formulation as follows:

F4. Null hypothesis (H0): Cigarettes do not increase the risk of cardiovascular diseases.Alternate hypothesis (Ha): Cigarettes increase the risk of cardiovascular diseases.F5. Null hypothesis (H0): ECG is not closely related to cardiovascular disease.Alternate hypothesis (Ha): ECG is not closely related to cardiovascular disease. F6. Practice yourself!

How to formulate scientific questions in biology

Here are a few examples for formulating questions in biology:

Q7. Learn to explore which genes are highly suppressed in lung cancer.Q8. How similar are the genomes of mice and humans?Q9. What are the differences in plants and minerals collected from localities A and B (Ca, Mg, K)?Q10. Does water temperature affect plankton?

Practice formulating these questions as hypotheses or concrete study questions!

You may find the answers at the end of Chapter 1.

How computation can help answer different questions in life sciences

It is generally believed that biostatistics is mostly about numbers and graphs. The reality is quite different. Biostatistics is also about understanding life science problems and finding ways to resolve those using statistical methods. There are six main problem-solving skills in biostatistics:

Helping life science professionals resolve research problems in these domains through the use of dataHelping life science professionals interpret the results of their researchMaking sure the published research is both statistically and biologically validHelping R&D professionals make decisions in the projectsRevealing objective truths about different phenomena through the use of dataExplaining the abstract features of mathematics and biology in an intuitive and easy-to-understand way

One of the most important impacts of biostatistics is transitioning from statistical knowledge to actual problem solutions in life sciences. This will be discussed in more detail in the rest of this chapter.

Biostatistics is needed to derive insights from life science experiments and convert measurements and observations to life science solutions.

Professionals in life science and biostatisticians, working together, design different types of experiments, measurements, and observations. All these can be written or stored as data. Data is a source of information from those experiments, measurements, and observations.

Data can originate from observations, too. One example of observation is the diagnosis by a dermatologist or the identification of species by biologists.

Biostatisticians are there to help make sure this data is valid and make it meaningful. Further, data should be organized and structured, often presented in the form of tables to be prepared for further analysis and interpretation.

To make the data useful, we must understand all the details about the data and how these are related to domains where biostatistics is applied. One of the most important aspects of biostatistics is the context around the data. This context can significantly affect the results and is one of the reasons why biostatisticians are more specialized in life science domains than general statisticians.

One of the main goals of biostatistics is to take all available inputs in the form of data and process them in such a way as to produce meaningful insights, answers, and conclusions and provide information to make decisions in life science.

Here is the biostatistics workflow:

Figure 1.3 – Biostatistics workflow

There are two main types of data: numerical (for example, the measurement of the hemoglobin level in blood in which we are using numerical values such as grams per liter or g/L) and categorical, such as a doctor’s diagnoses of their patients in a form; “Yes” for a positive diagnosis or “No” for a negative diagnosis. These types of data can be further divided into subcategories, which will be discussed in detail in the next chapters.

Understanding data sources is essential for biostatistics. Biostatistics is focused on statistical models but also on domain knowledge and, as such, has evolved as a separate branch of both statistics and life sciences.

This book will provide many different examples that will show you how to use biostatistics specifically for different domains, such as diabetes research, cardiology, and biostatistical studies. Further, in this chapter, we will discuss how the Python programming language can facilitate the implementation of biostatistical methods.

Biostatistics and Python

Most biostatistical analyses today are implemented in some form of software or a programming language. I chose Python as a programming language for this book for several reasons. Python is one of the most advanced languages for data science and biostatistics. As programmers today are moving toward using Python, keep in mind that it is one of the most wanted skills in most areas that have to do with analytics. Libraries such as Biopython and SciPy are among the more than 100,000 libraries that make Python so versatile, meaning that almost any biostatistical analysis can be performed using this programming language. It is open source, meaning it is transparent and free for anyone to use.

The following figure is an example of using Python for biostatistics:

Figure 1.4 – Biostatistics and Python

Its integration with advanced machine learning and bioinformatics algorithms gives a biostatistician a whole new spectrum of approaches and provides the most advanced frameworks for using biostatistical algorithms at this time.

Finally, the most important part – learning Python through a portfolio of practical projects provides you, as a reader, with two important qualities: being able to use one of the most wanted programming languages out there can be beneficial for your career, and having a portfolio of more than 10 practical projects using biostatistics and Python provides significant resources for your portfolio as someone who plans to use or advance your career by using biostatistics.

Answers for Chapter 1

(A stands for answer)

A6. Null hypothesis (H0): There are no risk factors for cardiovascular disease among the studied parameters.

Alternate hypothesis (Ha): Risk factors are present among the studied parameters.

A7. This question would have no concrete hypothesis. Instead, the overall goal of the study is to identify the genes that are highly expressed in lung cancer tissues..A8. We can re-formulate this question into three potential options based on different levels of similarity:The mouse-human genome similarity is low (0-50%).The mouse-human genome similarity is medium (50-90%).The mouse-human genome similarity is high (>90%).A9. To answer this research question, we can formulate it as follows:

Are Ca, Mg, and K concentrations higher in locality A compared to locality B?

A10. Null hypothesis (H0): Water temperature does not affect plankton.

Alternate hypothesis (Ha)