Basic Applied Bioinformatics - Chandra Sekhar Mukhopadhyay - E-Book

Basic Applied Bioinformatics E-Book

Chandra Sekhar Mukhopadhyay

0,0
116,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

An accessible guide that introduces students in all areas of life sciences to bioinformatics

Basic Applied Bioinformatics provides a practical guidance in bioinformatics and helps students to optimize parameters for data analysis and then to draw accurate conclusions from the results. In addition to parameter optimization, the text will also familiarize students with relevant terminology. Basic Applied Bioinformatics is written as an accessible guide for graduate students studying bioinformatics, biotechnology, and other related sub-disciplines of the life sciences.

This accessible text outlines the basics of bioinformatics, including pertinent information such as downloading molecular sequences (nucleotide and protein) from databases; BLAST analyses; primer designing and its quality checking, multiple sequence alignment (global and local using freely available software); phylogenetic tree construction (using UPGMA, NJ, MP, ME, FM algorithm and MEGA7 suite), prediction of protein structures and genome annotation, RNASeq data analyses and identification of differentially expressed genes and similar advanced bioinformatics analyses. The authors Chandra Sekhar Mukhopadhyay, Ratan Kumar Choudhary, and Mir Asif Iquebal are noted experts in the field and have come together to provide an updated information on bioinformatics.

Salient features of this book includes:

  • Accessible and updated information on bioinformatics tools
  • A practical step-by-step approach to molecular-data analyses
  • Information pertinent to study a variety of disciplines including biotechnology, zoology, bioinformatics and other related fields
  • Worked examples, glossary terms, problems and solutions

Basic Applied Bioinformatics gives students studying bioinformatics, agricultural biotechnology, animal biotechnology, medical biotechnology, microbial biotechnology, and zoology an updated introduction to the growing field of bioinformatics.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 580

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Preface

Acknowledgments

List of Abbreviations

SECTION I: Molecular Sequences and Structures

CHAPTER 1: Retrieval of Sequence(s) from the NCBI Nucleotide Database

1.1 INTRODUCTION

1.2 COMPONENTS OF THE NCBI NUCLEOTIDE DATABASE

1.3 OBJECTIVES

1.4 PROCEDURE

1.5 SOME USEFUL NUCLEOTIDE SEQUENCE DATABASES OF NCBI

1.6 QUESTIONS

CHAPTER 2: Retrieval of Protein Sequence from UniProtKB

2.1 INTRODUCTION

2.2 OBJECTIVE

2.3 PROCEDURE

2.4 QUESTIONS

CHAPTER 3: Downloading Protein Structure

3.1 INTRODUCTION

3.2 OBJECTIVE

3.3 PROCEDURE

3.4 QUESTIONS

CHAPTER 4: Visualizing Protein Structure

4.1 INTRODUCTION

4.2 OBJECTIVE

4.3 PROCEDURE

4.4 QUESTIONS

CHAPTER 5: Sequence Format Conversion

5.1 INTRODUCTION

5.2 OBJECTIVE

5.3 PROCEDURE

5.4 QUESTIONS

5.5 BRIEF DESCRIPTION OF SOME OF THE IMPORTANT MOLECULAR SEQUENCE FORMATS

CHAPTER 6: Nucleotide Sequence Analysis Using Sequence Manipulation Suite (SMS)

6.1 INTRODUCTION

6.2 OBJECTIVE

6.3 PROCEDURE

6.4 FORMAT CONVERSION

6.5 SEQUENCE ANALYSIS

6.6 SEQUENCE FIGURES

6.7 RANDOM SEQUENCES

6.8 MISCELLANEOUS

6.9 QUESTIONS

CHAPTER 7: Detection of Restriction Enzyme Sites

7.1 INTRODUCTION

7.2 OBJECTIVE

7.3 PROCEDURE (USING NEBCUTTER)

7.4 QUESTIONS

SECTION II: Sequence Alignment

CHAPTER 8: Dot Plot Analysis

8.1 INTRODUCTION

8.2 OBJECTIVE

8.3 PROCEDURE

8.4 PARAMETERS OF DOT PLOT ANALYSIS

8.5 INTERPRETATION

8.6 QUESTIONS

CHAPTER 9: Needleman–Wunsch Algorithm (Global Alignment)

9.1 INTRODUCTION

9.2 OBJECTIVE

9.3 PROCEDURE

9.4 QUESTIONS

CHAPTER 10: Smith–Waterman Algorithm (Local Alignment)

10.1 INTRODUCTION

10.2 OBJECTIVE

10.3 PROCEDURE

10.4 QUESTIONS

CHAPTER 11: Sequence Alignment Using Online Tools

11.1 INTRODUCTION

11.2 OBJECTIVE

11.3 PROCEDURE

11.4 INTERPRETATION OF RESULTS

11.5 COLOR SCHEME FOR AMINO ACID RESIDUES

11.6 QUESTIONS

SECTION III: Basic Local Alignment Search Tools

CHAPTER 12: Basic Local Alignment Search Tool for Nucleotide (BLASTn)

12.1 INTRODUCTION

12.2 OBJECTIVE

12.3 PROCEDURE

12.4 QUESTIONS

CHAPTER 13: Basic Local Alignment Search Tool for Amino Acid Sequences (BLASTp)

13.1 INTRODUCTION

13.2 OBJECTIVE

13.3 PROCEDURE

13.4 QUESTIONS

CHAPTER 14: BLASTx

14.1 INTRODUCTION

14.2 OBJECTIVE

14.3 PROCEDURE

14.4 INTERPRETATION OF BLASTx RESULTS

14.5 QUESTIONS

CHAPTER 15: tBLASTn

15.1 INTRODUCTION

15.2 OBJECTIVE

15.3 PROCEDURE

15.4 ALGORITHM PARAMETERS

15.5 INTERPRETATION OF tBLASTn RESULTS

15.6 QUESTIONS

CHAPTER 16: tBLASTx

16.1 INTRODUCTION

16.2 OBJECTIVE

16.3 PROCEDURE

16.4 ALGORITHM PARAMETERS

16.5 INTERPRETATION OF tBLASTx RESULTS

16.6 QUESTIONS

SECTION IV: Primer Designing and Quality Checking

CHAPTER 17: Primer Designing – Basics

17.1 INTRODUCTION

17.2 OTHER IMPORTANT FEATURES FOR DESIGNING “GOOD” PRIMERS

17.3 QUESTIONS

CHAPTER 18: Designing PCR Primers Using the

Primer3

Online Tool

18.1 INTRODUCTION

18.2 OBJECTIVE

18.3 PROCEDURE

18.4 OUTPUT

18.5 SELECTION OF THE BEST PRIMER‐PAIRS BY COMPARATIVE EVALUATION OF THE DESIGNED PRIMERS

18.6 QUESTIONS

CHAPTER 19: Quality Checking of the Designed Primers

19.1 INTRODUCTION

19.2 OBJECTIVE

19.3 PROCEDURE

19.4 IDT UNAFOLD – CHECKING THE SECONDARY STRUCTURE FORMATION OF THE AMPLICON

19.5 PRIMER‐BLAST – TO DETECT POSSIBLE SPURIOUS AMPLIFICATION

19.6 QUESTIONS

CHAPTER 20: Primer Designing for SYBR Green Chemistry of qPCR

20.1 INTRODUCTION

20.2 QUESTIONS

SECTION V: Molecular Phylogenetics

CHAPTER 21: Construction of Phylogenetic Tree: Unweighted‐Pair Group Method with Arithmetic Mean (UPGMA)

21.1 INTRODUCTION

21.2 ASSUMPTIONS

21.3 OBJECTIVE

21.4 PROCEDURE

21.5 INTERPRETATION OF UPGMA TREE

21.6 QUESTIONS

CHAPTER 22: Construction of Phylogenetic Tree: Fitch Margoliash (FM) Algorithm

22.1 INTRODUCTION

22.2 OBJECTIVE

22.3 PROCEDURE

22.4 INTERPRETATION OF THE FM TREE

22.5 QUESTIONS

CHAPTER 23: Construction of Phylogenetic Tree: Neighbor‐Joining Method

23.1 INTRODUCTION

23.2 OBJECTIVE

23.3 PROCEDURE

23.4 INTERPRETATION OF NJ TREE

23.5 QUESTIONS

CHAPTER 24: Construction of Phylogenetic Tree: Maximum Parsimony Method

24.1 INTRODUCTION

24.2 OBJECTIVE

24.3 PROCEDURE

24.4 INTERPRETATION OF MP TREE

24.5 QUESTIONS

CHAPTER 25: Construction of Phylogenetic Tree: Minimum Evolution Method

25.1 INTRODUCTION

25.2 OBJECTIVE

25.3 PROCEDURE

25.4 INTERPRETATION OF THE ME TREE

25.5 QUESTIONS

CHAPTER 26: Construction of Phylogenetic Tree Using MEGA7

26.1 INTRODUCTION

26.2 OBJECTIVE

26.3 PROCEDURE

26.4 INTERPRETATION OF PHYLOGENETIC TREE

26.5 QUESTIONS

CHAPTER 27: Interpretation of Phylogenetic Trees

27.1 INTRODUCTION

27.2 UNDERSTANDING PHYLOGENETIC TREES

27.3 REPRESENTATION OF PHYLOGENETIC TREES

27.4 METHODS FOR CONSTRUCTING EVOLUTIONARY TREES FROM INFERENCES

27.5 INFERRING PHYLOGENETIC TREES

27.6 QUESTIONS

SECTION VI: Protein Structure Prediction

CHAPTER 28: Prediction of Secondary Structure of Protein

28.1 INTRODUCTION

28.2 OBJECTIVE

28.3 SECONDARY STRUCTURE PREDICTION USING ONLINE TOOL PSIPRED

28.4 SECONDARY STRUCTURE PREDICTION USING THE ONLINE CDM TOOL

28.5 QUESTIONS

CHAPTER 29: Prediction of Tertiary Structure of Protein: Sequence Homology

29.1 INTRODUCTION

29.2 OBJECTIVE

29.3 PROCEDURE (SWISS‐MODEL PROGRAM)

29.4 OUTPUT

29.5 VISUALIZING THE PREDICTED STRUCTURE

29.6 INTERPRETATION OF RESULTS

29.7 QUESTIONS

CHAPTER 30: Protein Structure Prediction Using Threading Method

30.1 INTRODUCTION

30.2 OBJECTIVE

30.3 PROCEDURE

30.4 RESULTS AND INTERPRETATION

30.5 QUESTIONS

CHAPTER 31: Prediction of Tertiary Structure of Protein:

Ab Initio

Approach

31.1 INTRODUCTION

31.2 OBJECTIVE

31.3 PROCEDURE (RAPTORX)

31.4 JOB STATUS

31.5 OUTPUT AND INTERPRETATION OF RESULTS

31.6 QUESTIONS

CHAPTER 32: Validation of Predicted Tertiary Structure of Protein

32.1 INTRODUCTION

32.2 OBJECTIVE

32.3 PROCEDURE (WHAT IF TOOL FOR VALIDATING THE 3D STRUCTURE PREDICTION RESULTS)

32.4 INTERPRETATION OF RESULTS OF WHAT IF

32.5 MOLPROBITY TOOL FOR RAMACHANDRAN PLOT

32.6 INTERPRETATION OF RAMACHANDRAN PLOT ANALYSIS

32.7 QUESTIONS

SECTION VII: Molecular Docking and Binding Site Prediction

CHAPTER 33: Prediction of Transcription Binding Sites

33.1 INTRODUCTION

33.2 OBJECTIVE

33.3 TRANSFAC

33.4 BINDING SITES SEARCHING USING THE MATCH TOOL

33.5 QUESTIONS

CHAPTER 34: Prediction of Translation Initiation Sites

34.1 INTRODUCTION

34.2 OBJECTIVE

34.3 PROCEDURE

34.4 QUESTIONS

CHAPTER 35: Molecular Docking

35.1 INTRODUCTION

35.2 OBJECTIVE

35.3 PROCEDURE

35.4 RESULT AND INTERPRETATION

35.5 QUESTIONS

SECTION VIII: Genome Annotation

CHAPTER 36: Genome Annotation in Prokaryotes

36.1 INTRODUCTION

36.2 OBJECTIVE

36.3 PROCEDURE

36.4 INTERPRETATION OF GENEMARK OUTPUT

36.5 QUESTIONS

CHAPTER 37: Genome Annotation in Eukaryotes

37.1 INTRODUCTION

37.2 OBJECTIVE

37.3 PROCEDURE

37.4 INTERPRETATION OF GENSCAN OUTPUT

37.5 QUESTIONS

SECTION IX: Advanced Biocomputational Analyses

CHAPTER 38: Concepts of Real‐Time PCR Data Analysis

38.1 INTRODUCTION

38.2 GETTING STARTED WITH RT‐qPCR

38.3 PCR FLUORESCENCE CHEMISTRY

38.4 RT‐qPCR DATA ANALYSIS: GENE EXPRESSION ANALYSIS

38.5 QUESTIONS

CHAPTER 39: Overview of Microarray Data Analysis

39.1 CONCEPT

39.2 GETTING STARTED WITH MICROARRAY

39.3 MICROARRAY DATA ANALYSIS: GENE EXPRESSION ANALYSIS

39.4 STEPS INVOLVED IN MICROARRAY DATA ANALYSIS

39.5 FUNCTIONAL INFORMATION USING GENE NETWORKS AND PATHWAYS

39.6 LIVESTOCK RESEARCH THAT INVOLVED MICROARRAY ANALYSIS (SOME EXAMPLES)

39.7 APPLICATIONS OF MICROARRAY

39.8 QUESTIONS

CHAPTER 40: Single Nucleotide Polymorphism (SNP) Mining Tools

40.1 INTRODUCTION

40.2 OBJECTIVE

40.3 PROCEDURE

40.4 INTERPRETATION OF RESULTS

40.5 QUESTIONS

CHAPTER 41:

In Silico

Mining of Simple Sequence Repeats (SSR) Markers

41.1 INTRODUCTION

41.2 OBJECTIVE

41.3 MISA (MICROSATELLITE IDENTIFICATION TOOL)

41.4 RESULT

41.5 QUESTIONS

CHAPTER 42: Basics of RNA‐Seq Data Analysis

42.1 INTRODUCTION

42.2 AIM OF AN RNA‐SEQ EXPERIMENT

42.3 FAST SEQUENCE ALIGNMENT STRATEGIES

42.4 QUESTIONS

CHAPTER 43: Functional Annotation of Common Differentially Expressed Genes

43.1 INTRODUCTION

43.2 FUNCTIONAL ANNOTATION

43.3 QUESTIONS

CHAPTER 44: Identification of Differentially Expressed Genes (DEGs)

44.1 SECTION I. QUALITY FILTERING OF DATA USING PRINSEQ

44.2 SECTION II. IDENTIFICATION OF DIFFERENTIALLY EXPRESSED GENES – I (USING CUFFLINKS)

44.3 SECTION III. IDENTIFICATION OF DIFFERENTIALLY EXPRESSED GENES – II (USING RSEM‐DE PACKAGES EBSEQ, DESEQ2 AND EDGER)

44.4 USE OF DE PACKAGES FOR IDENTIFYING THE DIFFERENTIALLY EXPRESSED GENES

44.5 QUESTIONS

CHAPTER 45: Estimating MicroRNA Expression Using the

miRDeep2

Tool

45.1 INTRODUCTION

45.2 PREPROCESSING OF READS

45.3 INPUT FORMATS OF THE DATA FILE

45.4 OUTPUT FORMATS THAT CAN BE GENERATED

45.5 PRELIMINARY FILES USED IN THE EXAMPLE

45.6 QUESTIONS

CHAPTER 46: miRNA Target Prediction

46.1 INTRODUCTION

46.2 miRNA TARGET PREDICTION BY TARGETSCAN (http://targetscan.org/)

46.3 miRNA TARGET PREDICTION BY TARGETSCAN IN HUMAN

46.4 miRNA TARGET PREDICTION BY psRNATARGET (http://plantgrn.noble.org/psRNATarget>/)

46.5 miRNA TARGET PREDICTION BY miRANDA (http://www.microrna.org)

46.6 QUESTIONS

Appendix A: Usage of Internet for Bioinformatics

Appendix B: Important Web Resources for Bioinformatics Databases and Tools

INTRODUCTION

Appendix C: NCBI Database: A Brief Account

Appendix D: EMBL Databases and Tools: An Overview

INTRODUCTION

THE EMBL DATABASES

THE EMBL TOOLS

Appendix E: Basics of Molecular Phylogeny

GEOLOGICAL CLOCK

MORPHOLOGICAL PHYLOGENY TO MOLECULAR PHYLOGENY

BASIS OF MOLECULAR PHYLOGENY

MUTATION RATE

COMPONENTS OF A PHYLOGENETIC TREE

TYPES OF PHYLOGENETIC TREES

Appendix F: Evolutionary Models of Molecular Phylogeny

INTRODUCTION

Glossary

References

Webliography

Index

End User License Agreement

List of Tables

Chapter 04

TABLE 4.1 Computer short‐cuts to work on the image displayed by RasMol (http://www.openrasmol.org/doc/).

Chapter 07

TABLE 7.1 Meaning of different terminologies used in

NEBCutter

(Vince

et al

., 2003).

Chapter 09

TABLE 9.1

TABLE 9.2

Chapter 10

TABLE 10.1 Similarities and differences between NW and SW algorithms.

Chapter 12

TABLE 12.1 Overview of various types of BLAST algorithms available at the National Center for Biotechnology Information (NCBI) website, with their applications.

TABLE 12.2 Optional BLASTn parameters. Numbered arrows refer to the serial number (SN) of discussion in Table 12.3.

TABLE 12.3 Databases against which a query can be searched in BLASTn (http://www.ncbi.nlm.nih.gov/books/NBK153387/).

Chapter 13

TABLE 13.1 Algorithm parameters of BLASTp: Numbered arrows in Figure 13.2 refer to the serial number (SN) of discussion in this table.

Chapter 17

TABLE 17.1 Important parameters to be considered for designing “good” primers (http://www.premierbiosoft.com/tech_notes/PCR_Primer_Design.html).

TABLE 17.2 The acceptable values of Gibb’s free energy for various secondary structures of primer (http://ls23l.lscore.ucla.edu/Primer3/primer3web_help.htm).

Chapter 18

TABLE 18.1 Primer3 parameters, description and their optimal values/options (http://ls23l.lscore.ucla.edu/Primer3/primer3web_help.htm).

TABLE 18.2 Important parameters based on which primer is selected.

TABLE 18.3

Chapter 19

TABLE 19.1

TABLE 19.2

Chapter 20

TABLE 20.1 Optimal and permissible ranges of parameters of qPCR primers (SYBR green chemistry).

Chapter 21

TABLE 21.1

TABLE 21.2

TABLE 21.3

TABLE 21.4

TABLE 21.5

Chapter 22

TABLE 22.1

TABLE 22.2

TABLE 22.3

TABLE 22.4

TABLE 22.5

TABLE 22.6

Chapter 23

TABLE 23.1

TABLE 23.2

TABLE 23.3

TABLE 23.4

TABLE 23.5

TABLE 23.6

TABLE 23.7

TABLE 23.8

TABLE 23.9

TABLE 23.10

TABLE 23.11

TABLE 23.12

TABLE 23.13

TABLE 23.14

TABLE 23.15

TABLE 23.16

TABLE 23.17

TABLE 23.18

TABLE 23.19

TABLE 23.20

TABLE 23.21

TABLE 23.22

TABLE 23.23

TABLE 23.24

Chapter 24

TABLE 24.1

TABLE 24.2

TABLE 24.3

TABLE 24.4

TABLE 24.5

Chapter 25

TABLE 25.1

TABLE 25.2

Chapter 27

Table 27.1

TABLE 27.2 Comparison between the features of the trees generated from the following important phylogenetic algorithms (Desper and Gascuel, 2005).

TABLE 27.3 Pairwise distances (calculated by maximum composite likelihood model, using MEGA7) between the input sequences are shown in the lower triangular matrix.

Chapter 42

TABLE 42.1 Example and purpose of short read aligners.

TABLE 42.2 Example and purpose of long read aligners.

TABLE 42.3 Information about total reads of samples 1, 2, and 3, and values obtained by dividing the reads for each gene in each sample with the corresponding total reads.

TABLE 42.4 Calculation of RPKM by dividing the reads obtained after step 1 for each gene with gene length.

TABLE 42.5 Total reads of samples 1, 2, 3, and 4.

TABLE 42.6 Total reads per kb (RPK) of gene for sample 1, 2, and 3 (millions of reads equated to a scale of tens of reads).

TABLE 42.7 Calculation of TPM by dividing the total reads obtained in step 1 sample, with total reads per kb of gene.

TABLE 42.8 Comparison of reads of RPKM.

TABLE 42.9 Comparison of reads of TPM.

Appendix D

TABLE 1 Features and links of various EMBL databases.

TABLE 2 Description of various EMBL tools.

List of Illustrations

Chapter 01

FIGURE 1.1 Main search window of NCBI Nucleotide page and list of hits for nucleotide sequences of taurine

Drosha

(gene/mRNA).

FIGURE 1.2 Click on the “Send to” button to download and save (in a text file) the first three Drosha mRNA sequences in “Summary” format.

Chapter 02

FIGURE 2.1 Homepage of ExPASy server: select the “proteomics” option from the drop‐down menu for databases, and enter your protein name along with other keywords to begin search.

FIGURE 2.2 Click on the specific entry to open it in a separate window.

FIGURE 2.3 Peptide sequence of taurine SRY in FASTA format.

Chapter 03

FIGURE 3.1 Homepage of RCSB‐PDB. Specify the name of the protein and the species in the given box, and click on the search button (denoted by the symbol of a lens).

FIGURE 3.2 Visualization of 3D peptide structure obtained following PDB search.

Chapter 04

FIGURE 4.1 Graphical user interface (GUI) of RasMol and the drop‐down menu to open, modify or alter the display of the peptide.

FIGURE 4.2 A single peptide, displayed in ‘Wireframe’, ‘Backbone’, ‘Sticks’, ‘Spacefill’, ‘Ball and Stick’, ‘Ribbons’, ‘Strands’, ‘Cartoons’ and ‘Molecular surface’ patterns.

Chapter 05

FIGURE 5.1 Homepage of the

ReadSeq

biosequence format conversion tool.

FIGURE 5.2 Three sequence formats – namely, FASTA, Phylip and Clustal.

Chapter 06

FIGURE 6.1 “Combine FASTA” input page to provide input data, and the corresponding output page with the result.

FIGURE 6.2 “EMBL Trans Extractor” input page, and the corresponding output page with extracted results.

FIGURE 6.3 “Filter DNA” input page, along with various options as control parameters.

FIGURE 6.4 “Range Extractor Protein” input page and the corresponding output page with extracted sequences.

FIGURE 6.5 “Reverse Complement” input page and the corresponding output pages for “Complement”, “Reverse” and “Reverse Complement”, respectively (from left to right), of the input sequences.

FIGURE 6.6 “Protein Isoelectric Point” input page and the corresponding output page with results, with respect to the parameters.

Chapter 07

FIGURE 7.1 A short nucleotide sequence (oligo) can be searched in the input sequence for determining specific RE sites present in the oligos.

FIGURE 7.2 More options enable the user to make stringent selection of RE sites.

FIGURE 7.3 Result output window of

NEBCutter

. Details are discussed in the text under the sub‐heading “Inferring the output”.

Chapter 08

FIGURE 8.1 Depiction of plotting the straight line based on the runs of dots obtained from matches between residues along the

X

‐ and

Y

‐axes. Insertion in any of the sequences will distort the run of the straight line.

FIGURE 8.2 Interpretation of dot plot based on the same repeat sequence (shown above) which has been placed along both axes. The four different colors (yellow, green, blue and gray) have been shown to indicate the 1st, 2nd, 3rd and 4th repeat of “TACGGCTACAGTCACG”.

Chapter 09

FIGURE 9.1 Three types of movement along the matrix in dynamic programming.

FIGURE 9.2 Increment in the respective indexes of the cells (denoting row and column numbers, respectively) of the matrix, to indicate the movement along the cells.

FIGURE 9.3 Each cell is assigned three scores obtained from three possible movements – namely, horizontal, diagonal and vertical. The arrows indicate back‐tracing based on the highest score out of the three scores.

FIGURE 9.4 Trace‐back starts from the bottom right cell towards the top left cell, according to the highest score(s) obtained in the previous step. There could be more than one path at a point (i.e., cell), if that cell has been awarded more than one highest score, due to two or three movements in the previous step.

FIGURE 9.5 Global alignment (by NWA) has yielded seven equally good (same alignment score of 4) alignments.

Chapter 10

FIGURE 10.1 The scores in each cell are obtained from the movements from three directions – namely, horizontal, diagonal and vertical. The arrows indicate back‐tracing based on the highest score out of the three scores.

FIGURE 10.2 Trace‐back step: starting with the highest score in the matrix, moving towards the top left and stopping at the last positive score.

Chapter 11

FIGURE 11.1 The output of multiple sequence alignment using Clustal Omega is obtained in different tabs – “Alignments”, “Result Summary”, “Submission Details”.

Jalview

is the Java alignment viewer that displays the alignment, along with the consensus sequence.

Chapter 12

FIGURE 12.1 Main page for BLASTn search at NCBI. The sequence can be entered into the box as query sequences with either accession number or sequence in FASTA format. The gene identity number (i.e., the gi mentioned in this figure) is not currently used as sequence identifier in the NCBI nucleotide database.

FIGURE 12.2 Optional BLASTn parameters. Numbered arrows refer to the serial number of discussion in Table 12.3.

FIGURE 12.3 The result page of BLASTn contains the color key‐based alignment display, followed by a tabular description of sequence alignments and, finally, alignments of each of the sequence pairs (query vs. database sequence).

FIGURE 12.4

FIGURE 12.5

Chapter 13

FIGURE 13.1 Setting the parameters for BLASTp search at NCBI. The sequence(s) can be entered into the box as query sequence(s), with either NCBI Protein accession number or sequence(s) in FASTA format.

FIGURE 13.2 Optional BLASTp parameters. The numbered arrows refer to the serial number of discussion in Table 13.1.

FIGURE 13.3 Different sections of the result page of BLASTp. “A” indicates the putative conserved domain(s) detected by BLASTp search. Clicking on this image will open the graphical summary of the conserved domain(s) of that protein. “B” indicates the alignment and the scores in terms of color key, for each of the alignments. “C” indicates the table of alignment detail (Description, Max score, Total score, Query coverage, E‐value, Identity, and Accession). “D” shows the detail of the alignment residue‐wise.

FIGURE 13.4 Results of PSI‐BLAST. ‘E’ indicates the “Select for PSI blast” column, and “F” indicates the detailed result for each alignment.

FIGURE 13.5 Result of PHI‐BLAST. ‘G’ indicates the detailed result of each alignment. The asterisks in the second row of alignment indicate the pattern which has been given for PHI‐BLAST analysis.

FIGURE 13.6 The result page of DELTA‐BLAST. The components and parameters are similar to PSI‐BLAST.

Chapter 14

FIGURE 14.1 Homepage of BLASTx at NCBI. The sequence can be entered into the box (angled arrow) as query sequences, either with accession number(s) or as sequence(s) in FASTA format.

FIGURE 14.2 The results page of BLASTx contains a color key‐based alignment display, followed by a tabular description of sequence alignments and, finally, alignment of each of the sequence pairs (a query versus database sequence, called a subject sequence).

Chapter 15

FIGURE 15.1 Homepage for tBLASTn at NCBI. The query sequence(s) can be entered with either accession numbers or sequence(s) in FASTA format.

FIGURE 15.2 The results page of tBLASTn contains the color key‐based alignment display, followed by a tabular description of sequence alignments and, finally, alignment of each of the sequence pairs (query versus database sequences).

Chapter 16

FIGURE 16.1 Main page for tBLASTx search at NCBI. The sequence can be entered into the box as query sequences, with either accession no. or sequence, in FASTA format.

FIGURE 16.2 The result page of tBLASTx contains the color key‐based alignment display, followed by tabular description of sequence alignments and, finally, alignment of each of the sequence pairs (query versus database subject sequences).

Chapter 18

FIGURE 18.1 Setting the parameters of the Primer3 online tool for primer designing.

FIGURE 18.2 Output page of Primer3 online tool, displaying one pair of primers and their position in the input target sequence (asterisks below the bases).

Chapter 19

FIGURE 19.1 Homepage of Oligoanalyzer 3.1, indicating different parameters and functions for the output of the function “Analyze”.

FIGURE 19.2 Output of the function “Hairpin” of the Oligoanalyzer 3.1 tool, displaying the possible hairpins and the related thermodynamic values.

FIGURE 19.3 Prediction of secondary structure in the amplicon using the UNAFold tool of IDT.

FIGURE 19.4 Output result of Primer‐BLAST and selection of primers from the list displayed.

Chapter 21

FIGURE 21.1

FIGURE 21.2

FIGURE 21.3

FIGURE 21.B1

FIGURE 21.B2

FIGURE 21.4

FIGURE 21.5

Chapter 22

FIGURE 22.1

FIGURE 22.2

FIGURE 22.3

Chapter 23

FIGURE 23.1

FIGURE 23.2

FIGURE 23.3

FIGURE 23.4

FIGURE 23.5

Chapter 24

FIGURE 24.1

FIGURE 24.2

FIGURE 24.3

FIGURE 24.4

FIGURE 24.5

FIGURE 24.6

Chapter 25

FIGURE 25.1 Comparative depiction of the phylogenetic tree constructed from the same data set, using the MP method.

Chapter 26

FIGURE 26.1 Compile the unaligned, homologous molecular sequences in FASTA format in a text file.

FIGURE 26.2 Aligning the input sequences using either ClustalW or Muscle available in MEGA7 interface.

FIGURE 26.3 Exporting the alignment file and saving the alignment session for further use.

FIGURE 26.4 Selection of the best evolutionary model for further analyses.

FIGURE 26.5 Setting the parameters for phylogenetic analysis.

FIGURE 26.6 Controlling the display parameters using the menu bar parameters.

FIGURE 26.7 Controlling the tree display parameters using the left‐hand‐side buttons.

FIGURE 26.8 Insertion of figures for the external nodes (species name).

FIGURE 26.9 Saving the output phylogenetic tree as a PNG file.

FIGURE 26.10

Chapter 27

FIGURE 27.1 This dendrogram represents the evolutionary relationship among the taxa. The horizontal axis represents the evolutionary changes over time.

FIGURE 27.2 Swapping of the branches of the sub‐tree of the main tree does not change any meaning represented by the tree. The evolutionary distances between the OTUs remain unchanged.

FIGURE 27.3 Representing the same phylogenetic tree as circular, radiation, rectangular and straight orientations.

FIGURE 27.4 Converting a straight tree to a radiation tree by eliminating the depiction of divergence from common ancestor.

FIGURE 27.5 Phylogenetic trees constructed from nine sequences of 18s rRNA gene belonging to divergent species using various algorithms.

Chapter 28

FIGURE 28.1 Graphical user interface (GUI) of PSIPRED and filling the inputs in the Input tab.

FIGURE 28.2 The output tabs of PSIPRED shown in three sections.

FIGURE 28.3 GUI of the online CDM tool for prediction of protein secondary structures.

Chapter 29

FIGURE 29.1 Pairwise sequence alignment to determine the extent of sequence identity between the query and template sequences.

FIGURE 29.2 Open the page to initiate homology modeling using the SWISS‐MODEL workspace

FIGURE 29.3 Window of SWISS‐MODEL workspace for providing the input parameters and starting homology modeling.

FIGURE 29.4 Important sections of the SWISS‐MODEL output. One can download the complete result in PDF format, or can specifically download the sections as required.

Chapter 30

FIGURE 30.1 Home window of RaptorX for job submission.

FIGURE 30.2 (A) The modeled and non‐modeled residues; (B) 3D cartoon view of selected template; (C) the target‐template alignment view.

FIGURE 30.3 3D cartoon view of tertiary structure predicted by RaptorX server.

FIGURE 30.4 3 Class SS3 and 8 Class SS8 secondary structural element contribution to the 3D structure.

FIGURE 30.5 Conformationally ordered and disordered contribution of the residues in the 2D and 3D structure (C). Contribution of each residue in solvent accessibility (D).

Chapter 31

FIGURE 31.1 Job progress box of RaptorX after job submission.

FIGURE 31.2 Results windows of RaptorX, indicating assignment of protein domain and 3D prediction results.

Chapter 32

FIGURE 32.1 Homepage of WHAT IF web interface; click on “Build/check/repair model” link (in the left‐hand pane) to initiate validation of the predicted tertiary structure of the peptide.

FIGURE 32.2 Click on the “Upload your file” button to browse the input file and then click on “Send” button to upload the file to the server.

FIGURE 32.3 Output of the WHAT IF analysis of the predicted tertiary structure of the peptide.

FIGURE 32.4 Ramachandran plot for a typical protein structure. The different regions were taken from the observed phi‐psi distribution for 121 870 residues from 463 known X‐ray protein structures.

FIGURE 32.5 The homepage of MOLProbity tool for Ramachandran plot analysis.

Chapter 33

FIGURE 33.1 (a) TRANSFAC database search; (b) FACTOR table search; (c) TRANSFAC Factor entries; (d) output of TRANSFAC Factor table.

FIGURE 33.2 Creating sequences logos using the web interface.

FIGURE 33.3 (a) Searching Transfac matrix table; (b) TRANSFAC Matrix entries; (c) output of TRANSFAC Matrix table.

FIGURE 33.4 (a) MATCH user interface; (b) results page of MATCH output; (c) a simple visual representation of locations of the found matches.

Chapter 34

FIGURE 34.1 File format of inserted nucleotide sequence in NetStart 1.0.

FIGURE 34.2 Output format for translation start predictions for a vertebrate sequence.

FIGURE 34.3 File format of inserted nucleotide sequence in TIS Miner.

FIGURE 34.4 Output format for TIS Miner

Chapter 35

FIGURE 35.1 The identified active site or cavity within the receptor is marked as a star.

FIGURE 35.2 The “Submit Docking” tab at the top of the homepage of the SwissDock online tool takes you to this page. Upload the target and ligand files by clicking on the appropriate buttons.

FIGURE 35.3 Fitness of ligand and free energy of docked complex of the first and second binding poses, shown as “A” and “B”.

FIGURE 35.4 Fitness of ligand and free energy of docked complex of the third, fourth and fifth binding poses, shown as “C”, “D” and “E”.

Chapter 36

FIGURE 36.1 Homepage of the GeneMark online tool.

FIGURE 36.2 Specifying the parameters in GeneMark.hmm for prokaryotes for gene finding and annotation.

FIGURE 36.3 The output pages of the GeneMark online tool for prokaryotic gene prediction.

Chapter 37

FIGURE 37.1 Homepage of the online GENSCAN software.

FIGURE 37.2 Output page of the GENSCAN software.

Chapter 38

FIGURE 38.1 Amplification plot of RT‐qPCR.

FIGURE 38.2 Construction of standard curve. Construct standard curves for both target and reference genes individually, by plotting

C

t

values (through the

Y

‐axis) against the log

10

(template amount or dilution)

along the

X

‐axis.

FIGURE 38.3 SYBR Green fluorophore binds with double‐stranded DNA (PCR amplicon). The amount of DNA amplified is proportional to fluorescence intensity.

FIGURE 38.4 Absolute quantification of the transcript using the standard curve method. Using a known amount of DNA, a standard curve is made, and unknown samples are plotted on a regression line of known samples.

FIGURE 38.5 Relative quantification of RT‐qPCR transcript measurement, always expressed in terms of two samples (say, sample A in comparison to Sample B). Relative expression is measured in terms of fold change (either positive or negative fold change). Positive fold change indicates upregulation of genes in the A vs. B sample, whereas negative fold change indicates downregulation of genes in the A vs. B sample.

FIGURE 38.6 Analytical flow diagram of the use of real‐time PCR.

Chapter 39

FIGURE 39.1 Reference design (a) and loop design (b) of a two‐color microarray. Different colors (red and green here) represent microarray chips. In order to avoid dye bias, the same samples are used twice, with opposing labeling schemes, such as array 1: sample a (labeled with red dye) vs. Sample b (labeled with green dye) and array 2: sample b (labeled with red dye) vs. sample a (labeled with green dye).

FIGURE 39.2 Application of microarray for gene expression analysis. Fluorescently labeled cDNA or cRNA is hybridized with probes, and the image is scanned through a scanner. Based upon the intensity of the signal, up regulated (red dots) and down regulated (green dots) genes are detected.

FIGURE 39.3 Data transformation converts the raw signal intensity of each probe‐target hybridization into a log scale. Transformation of the data brings values in a normal distribution.

Chapter 40

FIGURE 40.1 Screenshot of Stacks software: http://creskolab.uoregon.edu/stacks/

FIGURE 40.2 Image of denovo_map.pl script of Stacks to call SNPs

de novo

from RADSeq data.

FIGURE 40.3 Image of ref_map.pl script of STACKS to call SNPs reference based from RAD‐Seq data.

FIGURE 40.4 Screenshot of GATK software website: https://www.broadinstitute.org/gatk/index.php.

FIGURE 40.5 Image of GATK command used to mine SNPs from an example dataset.

FIGURE 40.6 Result of GATK SNPs mining from an example dataset.

Chapter 41

FIGURE 41.1 MISA homepage.

FIGURE 41.2 Download misa.pl.

FIGURE 41.3 Download misa.ini.

FIGURE 41.4 The command prompt where code is written.

FIGURE 41.5 The output, as seen in testfile.misa.

FIGURE 41.6 The output, as seen in testfile.statistics.

Chapter 42

FIGURE 42.1

FIGURE 42.2

Chapter 43

FIGURE 43.1

FIGURE 43.2

FIGURE 43.3

FIGURE 43.4

FIGURE 43.5

FIGURE 43.6

FIGURE 43.7

FIGURE 43.8

FIGURE 43.9

FIGURE 43.10

FIGURE 43.11

FIGURE 43.12

FIGURE 43.13

FIGURE 43.14

Chapter 44

FIGURE 44.1 The basic command for running PRINSEQ‐lite.

FIGURE 44.2 Summary statistics after running prinseq‐lite.pl.

FIGURE 44.3 Six files generated after running prinseq‐lite.pl.

FIGURE 44.4 Workflow for identifying DEGs using Cufflinks.

FIGURE 44.5 UCSC genome browser.

FIGURE 44.6 Click on downloads, genomics data and then select “cow”.

FIGURE 44.7 Zip file and FASTA file of the cow genome.

FIGURE 44.8 Downloading the GTF file.

FIGURE 44.9 Indexing the genome using GMAP.

FIGURE 44.10 Indexing files generated after indexing.

FIGURE 44.11 Files generated after running cufflinks on control BAM file.

FIGURE 44.12 Files generated after running cufflinks on infected BAM file.

FIGURE 44.13 The assemblies.txt file.

FIGURE 44.14 gene_exp.diff file giving the fold change of the genes, along with significance.

FIGURE 44.15 Workflow for identifying DEGs using RSEM and DE packages.

FIGURE 44.16 .bash_profile with the path added.

FIGURE 44.17 Echo $PATH indicating that the path is added.

FIGURE 44.18 wget command downloading the genome from the ensemble genome browser.

FIGURE 44.19 Folder ftp.ensembl.org created after the download.

FIGURE 44.20 The chromosome gunzip files in the folder ftp.ensembl.org.

FIGURE 44.21 Direct download from the.ftp site.

FIGURE 44.22 Index files created after indexing using bowtie 2.0.

FIGURE 44.23 Six files generated after running the calculate expression command.

FIGURE 44.24 Expected counts, TPM and FPKM of each of the ensemblIDs.

FIGURE 44.25 Combining the counts of all the files and rounding them to the nearest integer.

FIGURE 44.26 Loading the EBSeq package in R.

FIGURE 44.27 Input file for EBSeq.

FIGURE 44.28 Running iterations of EBSeq.

FIGURE 44.29 Identifying DEGs in EBSeq.

FIGURE 44.30 Fold change of all the ensemblIDs.

FIGURE 44.31 Significant DE genes.

FIGURE 44.32 Loading DESeq2 package.

FIGURE 44.33 Example input data set.

FIGURE 44.34 Fold change and significance of ensemblIDs.

FIGURE 44.35 Running the DESeq2 package.

FIGURE 44.36 Fold change and significance of ensemble IDs in the file DEDESeq2.txt.

FIGURE 44.37 Significant DEGs in DEDEseq2.txt.

FIGURE 44.38 reOrdered command ouput and the various column IDs generated.

FIGURE 44.39 Loading the edgeR package in R.

FIGURE 44.40 Input file for edgeR.

FIGURE 44.41 Running edgeR in R

FIGURE 44.42 Fold change and significance of ensemblIDs in DEEdgeR.txt.

Chapter 45

FIGURE 45.1

FIGURE 45.2

FIGURE 45.3

FIGURE 45.4

FIGURE 45.5

FIGURE 45.6

FIGURE 45.7

Chapter 46

FIGURE 46.1 Home page of TargetScan tool.

FIGURE 46.2 Output page showing multiple transcripts in the TargetScan tool.

FIGURE 46.3 Output page of the TargetScan tool, showing conserved sites for miRNA families.

FIGURE 46.4 Detailed table of all the conserved sites.

FIGURE 46.5 Input page of the TargetScan tool.

FIGURE 46.6 Detailed information about the target gene symbol in the TargetScan tool.

FIGURE 46.7 Input page of the psRNATarget tool.

FIGURE 46.8 Input page of the psRNATarget tool for other parameters.

FIGURE 46.9 Result page of the psRNATarget tool.

FIGURE 46.10 Results file of the miRanda tool.

Appendix E

FIGURE E1 Different types of point mutations leading to codon change.

FIGURE E2 The components of a rooted phylogenetic tree.

FIGURE E3 Diagrammatic representation of monophyletic, paraphyletic and polyphyletic groups of taxa.

Appendix F

FIGURE F1 Substitution of nucleotides leading to transition and transversion.

FIGURE F2 Jukes–Cantor one‐parameter substitution model

FIGURE F3 Rates of transition and transversion are the same (

α

).

FIGURE F4 Amount of base‐substitution in a period of time “

ζ

” and equal rates of transition and transversion (

α

).

FIGURE F5 K80 model: amount of base‐substitution in unit time period (

t

 = 1), assuming different rates of transition (

α

) and transversion (

β

).

FIGURE F6 F81 model: rate of base‐substitution is different for four bases: adenine (

ρ

A

), guanine (

ρ

G

), cytosine (

ρ

C

) and thymine (

ρ

T

).

Guide

Cover

Table of Contents

Begin Reading

Pages

iii

iv

v

xi

xii

xii

xv

xvi

xvii

1

3

4

5

6

7

9

10

11

12

13

15

16

17

19

20

21

22

23

24

25

26

27

28

29

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

53

54

55

56

57

59

60

61

62

63

64

65

66

67

68

69

70

71

73

74

75

76

77

79

81

82

83

84

85

86

87

88

89

91

92

93

94

95

96

97

98

99

100

101

103

104

105

106

107

109

110

111

112

113

114

115

116

117

118

119

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

139

140

141

142

143

144

145

146

147

148

149

151

152

153

154

155

157

159

160

161

162

163

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

197

198

199

200

201

202

203

204

205

206

209

211

212

213

214

215

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

236

237

238

239

240

241

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

263

265

266

267

268

269

270

271

272

273

275

276

277

278

279

280

281

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

305

306

307

308

309

310

311

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

377

378

379

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

Basic Applied Bioinformatics

 

Chandra Sekhar Mukhopadhyay

Ratan Kumar Choudhary

Mir Asif Iquebal

 

With contributions from

 

Ravi GVPPS Kumar, Sarika, Dinesh Kumar, Aditya Prasad Sahoo, Amit Kumar, Saurabh Jain, Surbhi Panwar, Ashwani Kumar, Harpreet Kaur Manku

 

 

 

 

 

 

 

 

This edition first published 2018© 2018 John Wiley & Sons, Inc.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Chandra Sekhar Mukhopadhyay, Ratan Kumar Choudhary, and Mir Asif Iquebal to be identified as the authors of this work has been asserted in accordance with law.

Registered Office(s)John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of WarrantyIn view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data

Names: Mukhopadhyay, Chandra Sekhar, author. | Choudhary, Ratan Kumar, author. | Iquebal, Mir Asif, author.Title: Basic applied bioinformatics / by Chandra Sekhar Mukhopadhyay, Ratan Kumar Choudhary, Mir Asif Iquebal.Description: 1st edition. | Hoboken, NJ : Wiley, [2017] | Includes bibliographical references and index. |Identifiers: LCCN 2017015387 (print) | LCCN 2017019742 (ebook) | ISBN 9781119244370 (pdf) | ISBN 9781119244417 (epub) | ISBN 9781119244332 (hardback)Subjects: LCSH: Bioinformatics–Textbooks. | BISAC: MEDICAL / Biostatistics.Classification: LCC QH324.2 (ebook) | LCC QH324.2 .M85 2017 (print) | DDC 572.80285–dc23LC record available at https://lccn.loc.gov/2017015387

Cover Design: WileyCover Images: (top) @ mashuk/Gettyimages; (middle) @ alice‐photo/Shutterstock; (bottom) © kentoh/Shutterstock

Dedicated to students, researchers and professionals

Preface

Bioinformatics, a discipline that attempts to make predictions about biological functions using data from molecular sequence (nucleotide and protein) analysis and involves application of information science to biology has, over the years, evolved exponentially in the genomics era. Today it has become an indispensable component of biological science, including its application in a number of applied areas.

There has long been a need among students and researchers for a primer book on the application of bioinformatics tools in various spheres of veterinary and agricultural sciences. This being the era of multi‐tasking, research workers who do not possess a background in computer or bioinformatics often stumble over in silico analysis of molecular data. This book provides practical know‐how for graduate students of bioinformatics, biotechnology and other streams of biological science, and also for those who need to learn bio‐computational analyses of the large volume of molecular data that is being generated in thousands of laboratories throughout the world.

The topics considered in this book are the basic ones that a student or researcher of the fields above should know. In addition, this book covers the syllabi of the graduate or undergraduate course called “Introduction to Bioinformatics” (or course name similar to that) that is offered in several universities. Some of the chapters, covering areas such as genome annotation in prokaryotes and eukaryotes, an overview of microarray data analysis, use of MISA for microsatellite sequence identification and SNP mining, have also been introduced for out‐of‐the‐box applications.

In general, the book serves as a reference book for those working in biocomputational research and studies. The contents of this book cover wider areas of bioinformatics. Several freely available software tools (online or offline) are available, and students and researchers can use them for in silico analysis. However, in some instances, students become stuck while optimizing parameters for data analysis and drawing appropriate inferences. Also, they are often not familiar with several terminologies. This book explains steps for parameter optimization of the tools being used, as well as the basic terminologies. The results obtained have been explained, to demonstrate how inferences are drawn.

This book can also serve as a practical manual for the elucidation of critical steps, with annotation and explanation. It begins with basic aspects of bioinformatics, including frequently used terminology, concept development, handling molecular sequences, BLAST analyses, primer designing, phylogenetic tree construction, prediction of protein structures and genome annotation. In the last few chapters, some advanced topics of bioinformatics have been covered, such as analysis of transcriptome data, identification of differentially expressed genes and prediction of microRNA targets.

Each chapter demonstrates the steps with an example, which involves stepwise elucidation of the procedures and explanation of the obtained results. The practical methodology is depicted with screenshots of the software being used, along with legends to explain the screenshot view. New terminologies introduced in some chapters have been provided. Additionally, four or five questions are given at the end of each chapter, with any hints which are deemed to be required for some questions.

We believe that there could be some unintentional mistakes remaining in this book. We sincerely request the reader to apprise the editors for typographical or other errors, if found. It is very common that the version of molecular sequences in public repository is updated over time, or sometimes one or more sequence entries are deleted. The readers are requested to update the editors about such changes. Similarly, the uniform resource locators (URLs) of the websites containing bioinformatics tools or databases can change suddenly. We will be careful to update these changes in the next edition of the book. Readers are also requested to assist us in this regard.

It is hoped that this book will be a useful primer for beginners of this fast‐expanding field.

Acknowledgments

We thank Ms. Mindy Okura‐Marszycki and Mr. Vishnu Narayanan of Wiley for their timely help and encouragements. We extend a special note of thanks to Prof. G.S. Brah, Founder Director, and Prof. Ramneek Verma, Director, of the School of Animal Biotechnology, GADVASU, Ludhiana, for providing the conducive working environment and for inspiring us to contribute to the field of bioinformatics. They evaluated some of the chapters and raised critical questions for improving the quality. The authors thankfully appreciate Miss J.K. Dhanoa and Ms. H.K. Manku for thoroughly checking the syntax of the manuscript, helping in editing and framing the diagrams in the proper format. The chapters were also evaluated for lucidness and ease of understanding by the graduate students of the Iowa State University, Mrs. Supreet Kaur (MSc Biochemistry) and Shravanti Krishna (PhD Biochemistry). Sincere thanks are extended to Dr. Nikhlesh Singh (Assistant Professor, Physiology, the University of Tennessee Health Science Center (UTHSC), Memphis, USA), Dr Monson Melissa (Postdoc Research Associate, Animal Science, Iowa State University) and Dr. Sangita Singh (Post Doc., Department of Food Science and Human Nutrition, Iowa State University) for their constructive criticisms to improve the chapters. Dr. Shivani Sood, Assistant Prof. (Biotechnology, Mukand Lal National College, Yamuna Nagar, Haryana, India), deserves special mention for critically checking the chapters and giving constructive input. All the freely available software and databases covered in this book are duly acknowledged. We are obliged to all those who have directly or indirectly contributed to writing this book.

Our sources of inspiration have been our families, colleagues and students. Nevertheless, we bow before the Almighty and Mother Nature for giving us the potential to accomplish the task.

List of Abbreviations

AFLP

a

mplified

f

ragment

l

ength

p

olymorphism

ASCII

A

merican

S

tandard

C

ode for

I

nformation

I

nterchange

BAC

b

acterial

a

rtificial

c

hromosome

BAM

b

inary

a

lignment/

m

ap

BIC

B

ayesian

i

nformation

c

riterion

BLAST

b

asic

l

ocal

a

lignment

s

earch

t

ool

BWA

B

urrows–

W

heeler

a

lgorithm

BWT

B

urrows–

W

heeler

t

ransformation

cDNA

c

omplementary

DNA

CINEMA

c

olor

in

teractive

e

ditor for

m

ultiple

a

lignments

cRNA

c

omplementary

RNA

dbGaP

d

ata

b

ase of

g

enotypes

a

nd

p

henotypes

dbVar

d

ata

b

ase of

var

iation

DDBJ

D

NA

d

ata

b

ank of

J

apan

DEG

d

ifferentially

e

xpressed

g

enes

DNA

d

eoxyribo

n

ucleic

a

cid

DP

d

ynamic

p

rogramming

EMBL

E

uropean

M

olecular

B

iology

L

aboratory

EST

e

xpressed

s

equence

t

ag

ExPASy

ex

pert

p

rotein

a

nalysis

sy

stem

F81 model

F

elsenstein (

1981

) model

FASTA

fast

a

ll

FDQN

f

ully

q

uantified

d

omain

n

ame

FPKM

f

ragment

p

er

k

ilobase of exon per

m

illion mappable reads

GATK

g

enome

a

nalysis

t

ool

k

it

gi or GI

g

ene

i

dentity

GO

g

ene

o

ntology

GOR

G

arnier,

O

sguthorpe, and

R

obson

GSS

g

enome

s

urvey

s

equence

GTF

g

ene

t

ransfer

f

ormat

GTR

g

eneralized

t

ime‐

r

eversible

GUI

g

raphical

u

ser

i

nterface

GWAS

g

enome‐

w

ide

a

ssociation

s

tudies

HKY85 model

H

asegawa,

K

ishino and

Y

ano (

1985

) model

IBL

i

nternal

b

ranch

l

ength

InDels

in

sertion and

del

etions

INSDC

I

nternational

N

ucleotide

S

equence

D

atabase

C

ollaboration

IUPAC

I

nternational

U

nion of

P

ure and

A

pplied

C

hemistry

JALVIEW

Ja

va

al

ignment

view

er

JC69 model

J

ukes and

C

antor (

1969

) model

K80 model

K

imura (

1980

) model

MACAW

m

ultiple

a

lignment

c

onstruction and

a

nalysis

w

orkbench

MAFFT

m

ultiple

a

lignment using

f

ast

F

ourier

t

ransform

ME

m

inimum

e

volution

MEGA

m

olecular

e

volution and

g

enetic

a

nalysis

MISA

mi

cro

sa

tellite identification tool

ML

m

aximum

l

ikelihood

MP

m

aximum

p

arsimony

MSA

m

ultiple

s

equence

a

lignment

MSRE

m

ethylation

s

ensitive

r

estriction

e

nzymes

MUSCLE

mu

ltiple

s

equence

c

omparison by

l

og‐

e

xpectation

mYa

m

illion

y

ears

a

go

NBRF

N

ational

B

iomedical

R

esearch

F

oundation

NCBI

N

ational

C

enter for

B

iotechnology

I

nformation

NGS

n

ext‐

g

eneration

s

equencing

NJ

n

eighbor

j

oining

NWA

N

eedleman–

W

unsch

a

lgorithm

ORF

o

pen

r

eading

f

rame

OTU

o

perational

t

axonomic

u

nit

PDB

p

rotein

d

ata

b

ank

pI/MW

isoelectric point to molecular weight ratio

PIR

p

rotein

i

nformation

r

esource

PSD

p

rotein

s

equence

d

atabase

PWMs

p

osition

w

eight

m

atrice

s

RCSB

R

esearch

C

ollaboratory for

S

tructural

B

ioinformatics

RE

r

estriction

e

nzyme

RF

r

eading

f

rame

RFLP

r

estriction

f

ragment

l

ength

p

olymorphism

RPKM

r

ead

p

er

k

ilobase of exon per

m

illion mappable reads

rRNA

r

ibosomal

RNA

SAM

s

equence

a

lignment/

m

ap

SCOP

s

tructural

c

lassification

o

f

p

rotein

SIB

S

wiss

I

nstitute of

B

ioinformatics

SNPs

s

ingle

n

ucleotide

p

olymorphisms

SPR

s

ubtree

p

runing

r

egrafting

SSR

s

imple

s

equence

r

epeats

STS

s

equence‐

t

agged

s

ite

SWA

S

mith–

W

aterman

a

lgorithm

T92 model

T

amura (

1992

) model

TBR

t

ree

b

isection

r

econnection

T‐Coffee

t

ree‐based

c

onsistency

o

bjective

f

unction

f

or alignment

e

valuation

TFBS

t

ranscription

f

actor

b

inding

s

ites

TFs

t

ranscription

f

actors

TIS

t

ranslation

i

nitiation

s

ites

TN93 model

T

amura and

N

ei (

1993

) model

T‐P

t

ransversion‐

p

arsimony

TPA

t

hird‐

p

arty

a

nnotation

TPM

t

ranscripts

p

er

m

illion

TRANSFAC

trans

cription regulatory

fac

tors

tRNA

t

ransfer

RNA

TTS

t

riplex‐forming oligonucleotide

t

arget

s

equences

uGDT

u

nnormalized

g

lobal

d

istance

t

est

UniProt

uni

versal

prot

ein resource

UPGMA

u

nweighted

p

air

g

roup

m

ethod with

a

rithmetic mean

URL

u

niform

r

esource

l

ocator

VCF

v

ariant

c

all

f

ormat

VNTR

v

ariable

n

umber

t

andem

r

epeat

WGS

w

hole

g

enome

s

hotgun

wwPDB

w

orld

w

ide

p

rotein

d

ata

b

ank

YAC

y

east

a

rtificial

c

hromosome

SECTION IMolecular Sequences and Structures

CHAPTER 1Retrieval of Sequence(s) from the NCBI Nucleotide Database

CS Mukhopadhyay and RK Choudhary

School of Animal Biotechnology, GADVASU, Ludhiana

1.1 INTRODUCTION

The NCBI nucleotide database (http://www.ncbi.nlm.nih.gov/nucleotide/) is an archive of gene, transcript, and fragments of genomic DNA sequences. It combines several online public repositories, including GenBank (the genetic sequence database of NIH), RefSeq (annotated, non‐redundant reference sequence from genomic, transcript and protein), TPA (third‐party annotated data on nucleotide sequences), and PDB (protein databank: a repository of 3D structures of proteins and nucleic acids). The International Nucleotide Sequence Database Collaboration (INSDC) maintains the liaison between the three major molecular data repositories – namely, NCBI, DDBJ, and EMBL – to share the nucleotide data present in any of those databanks.

A brief description of the NCBI databases has been given in Appendix A “NCBI Database: A Brief Account” at the end of this book.

1.2 COMPONENTS OF THE NCBI NUCLEOTIDE DATABASE

GenBank

: An annotated collection of all publicly available nucleotide and

in silico

translated protein sequences.

EST database

: Maintains expressed sequence tags (ESTs) and short, single‐pass reads (the sequence‐fragments/reads obtained by loading the reaction in a lane only once and, hence, obtained after analyzing the input sequence by the sequencer only once) from mRNA (cDNA).

GSS database

: A database of genome survey sequences (GSS), or short single‐pass genomic sequences (TTS, Exon Trapped, BAC/YAC, etc.)

1.3 OBJECTIVES

To search and download nucleotide sequences from NCBI Nucleotide database and save as a text file (*.txt). The sequence of interest for downloading could be complete or partial gene/mRNA/coding sequence, non‐coding RNA (rRNA, tRNA), non‐coding and repeat sequences (VNTR) in the genome, partial genomic DNA sequences, and so on.

1.4 PROCEDURE

1.4.1 Nucleotide sequence search

Open the NCBI nucleotide page:

http://www.ncbi.nlm.nih.gov/nucleotide/

Search the target sequence by providing the name of the gene and keywords – say, for example, the

Drosha

gene sequence in taurine cattle (

Bos taurus

) (

Figure 1.1

). Thus, the keywords are: “Drosha Bos taurus” (type your keywords without quotes, or else the quotes will instruct the search engine to find out the exact phrase within quotes, which ultimately limits your search). Then click on the “Search” button.

The nucleotide sequence(s) can also be searched by specifying the

accession number

(s), separated by a space (or comma). Please note that from September 2016 onwards, NCBI has phased out the sequence gi numbers. The accession numbers are

unique codes

assigned as an

identifier

to each nucleotide sequence in the database.

FIGURE 1.1 Main search window of NCBI Nucleotide page and list of hits for nucleotide sequences of taurine Drosha (gene/mRNA).

1.4.2 Downloading the selected sequences

Now, for example, select the first three sequences (depending on your requirement) by checking the small checkboxes on the left‐hand side of each of the sequences.

Click on the “Send to” button, located at the top‐right side of the page (

Figure 1.2

). Choose the destination of the selected sequences (to a *.txt file or to the clipboard for copying and pasting to a separate file, or collection in your NCBI account). Register yourself to NCBI and get your account‐Id and password. Select the sequence format (Summary, GenBank, FASTA, etc.), the items per page and mode of sorting the selected sequences from the drop‐down menus before downloading in a text file.

Finally, click on the “Create File” button to download the nuccore_result.txt file (default name) (see

Figure 1.2

, below). Open the file to obtain the sequences in the specified format and order.

FIGURE 1.2 Click on the “Send to” button to download and save (in a text file) the first three Drosha mRNA sequences in “Summary” format.

1.5 SOME USEFUL NUCLEOTIDE SEQUENCE DATABASES OF NCBI

One can search other NCBI databases that archive nucleotide sequences:

species‐wise or chromosome assembly search (WGS or other assembly of chromosome or genome, likeBos_taurus_UMD_3.1.1);

clone (clones associated with genomics, cDNA and cell‐based libraries, viz. BTDAEX‐80 K11, HWYUBAC‐1‐028‐04‐H12);

dbGaP (interaction of genotype and phenotype, viz. phs000287.v4.p1Cardiovascular Health Study (CHS) Cohort);

dbVar (large‐scale genomic variation, nsv836042, nsv836041 etc), SNP, etc., among many other databases. The process of downloading the data as a text file is the same.

1.5.1 Modifying the search with the “Limits” option (currently not available)

The user can narrow down the search by using the parameters available after clicking on the hyperlink “Limits”. However, NCBI has removed this option nowadays. The available options are:

Published in the last (specify the available days or mention date range).

Modified in the last (specify the available days or mention your own date range).

Search

Field Tags

(different fields of GenBank flat file Accession, Author, Bioproject, etc.).

Segmented sequences (master of set or part of set).

Source database (RefSeq or GenBank or EMBL or DDBJ or PDB).

Molecule (Genomic DNA/RNA, mRNA, rRNA or cRNA).

Gene Location (Genomic DNA/RNA or Mitochondrion, Chloroplast or “any” of the above types).

Exclude (STSs and/or working draft and/or TPA and/or patents).

1.5.2 Modifying the search with “Nucleotide Advanced Search Builder”

Click on the hyperlinked word “Advanced” just below the text box. The new page enables you to build your search settings.

Please note that the search builder enables us to specify the keywords according to their type (i.e., accession, assembly, author, journal and so on); in turn, this instructs the search engine to pinpoint the keywords from the database, depending on its feature or type.

Let us take our previous example: “Drosha Bos taurus”. In the search builder, click on the drop‐down menu (shown as “All Fields”) and select “Gene Name” and type “Drosha”. The role of “Show index list” is discussed in the next paragraph. Next, click on the drop‐down list of the second‐row field and select “Organism” and then type “Bos taurus” (without quotes). If you have more keywords, then add more rows accordingly, and select the specific field before typing the keyword(s).

The “Show index list” button will show the list of indexes from which you can specify your index. To move further along the index, you can use “Previous 200” or “Next 200” options. The “+” and “–” symbols beside each of the text boxes allow you to add (new) or delete the corresponding text boxes.