Python Programming for Linguistics and Digital Humanities - Martin Weisser - E-Book

Python Programming for Linguistics and Digital Humanities E-Book

Martin Weisser

0,0
32,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Learn how to use Python for linguistics and digital humanities research, perfect for students working with Python for the first time

Python programming is no longer only for computer science students; it is now an essential skill in linguistics, the digital humanities (DH), and social science programs that involve text analytics. Python Programming for Linguistics and Digital Humanities provides a comprehensive introduction to this widely used programming language, offering guidance on using Python to perform various processing and analysis techniques on text. Assuming no prior knowledge of programming, this student-friendly guide covers essential topics and concepts such as installing Python, using the command line, working with strings, writing modular code, designing a simple graphical user interface (GUI), annotating language data in XML and TEI, creating basic visualizations, and more.

This invaluable text explains the basic tools students will need to perform their own research projects and tackle various data analysis problems. Throughout the book, hands-on exercises provide students with the opportunity to apply concepts to particular questions or projects in processing textual data and solving language-related issues. Each chapter concludes with a detailed discussion of the code applied, possible alternatives, and potential pitfalls or error messages.

  • Teaches students how to use Python to tackle the types of problems they will encounter in linguistics and the digital humanities
  • Features numerous practical examples of language analysis, gradually moving from simple concepts and programs to more complex projects
  • Describes how to build a variety of data visualizations, such as frequency plots and word clouds
  • Focuses on the text processing applications of Python, including creating word and frequency lists, recognizing linguistic patterns, and processing words for morphological analysis
  • Includes access to a companion website with all Python programs produced in the chapter exercises and additional Python programming resources

Python Programming for Linguistics and Digital Humanities: Applications for Text-Focused Fields is a must-have resource for students pursuing text-based research in the humanities, the social sciences, and all subfields of linguistics, particularly computational linguistics and corpus linguistics.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 571

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Table of Contents

Title Page

Copyright Page

Dedication Page

List of Figures

About the Companion Website

1 Introduction

1.1 Why Program? Why Python?

1.2 Course Overview and Aims

1.3 A Brief Note on the Exercises

1.4 Conventions Used in this Book

1.5 Installing Python

1.6 Introduction to the Command Line/Console/Terminal

1.7 Editors and IDEs

1.8 Installing and Setting Up WingIDE Personal

1.9 Discussions

2 Programming Basics I

2.1 Statements, Functions, and Variables

2.2 Data Types – Overview

2.3 Simple Data Types

2.4 Operators – Overview

2.5 Creating Scripts/Programs

2.6 Commenting Your Code

2.7 Discussions

3 Programming Basics II

3.1 Compound Data Types

3.2 Lists

3.3 Simple Interaction with Programs and Users

3.4 Problem Solving and Damage Control

3.5 Control Structures

4 Intermediate String Processing

4.1 Understanding Strings

4.2 Cleaning Up Strings

4.3 Working with Sequences

4.4 More on Tuples

4.5 ‘Concatenating’ Strings More Efficiently

4.6 Formatting Output

4.7 Handling Case

4.8 Discussions

5 Working with Stored Data

5.1 Understanding and Navigating File Systems

5.2 Stored Data

5.3 Opening and Closing Files

5.4 Reading File Contents

5.5 Error Handling

5.6 Writing to Files

5.7 Working with Folders and Paths

5.8 Discussions

6 Recognising and Working with Language Patterns

6.1 The

re

Module

6.2 General Syntax

6.3 Understanding and Working with the Match Object

6.4 Character Classes

6.5 Quantification

6.6 Masking and Using Special Characters

6.7 Regex Error Handling

6.8 Anchors, Groups and Alternation

6.9 Constraining Results Further

6.10 Compilation Flags

6.11 Discussions

7 Developing Modular Programs

7.1 Modularity

7.2 Dictionaries

7.3 User‐defined Functions

7.4 Understanding Modules

7.5 Documenting Your Module

7.6 Installing External Modules

7.7 Classes and Objects

7.8 Testing Modules

7.9 Discussions

8 Word Lists, Frequencies and Ordering

8.1 Introduction to Word and Frequency Lists

8.2 Generating Word Lists

8.3 Sorting Basics

8.4 Generating Basic Word Frequency Lists

8.5 Lambda Functions

8.6 Discussions

9 Interacting with Data and Users Through GUIs

9.1 Graphical User Interfaces

9.2 PyQt Basics

9.3 Designing More Advanced GUIs

9.4 Discussions

10 Web Data and Annotations

10.1 Markup Languages

10.2 Brief Intro to HTML

10.3 Using the

urllib.request

Module

10.4 Extracting Text from Web Pages

10.5 List and Dictionary Comprehension

10.6 Brief Intro to XML

10.7 Complex Regex Replacements Using Functions

10.8 Brief Intro to the TEI Scheme

10.9 Discussions

11 Basic Visualisation

11.1 Using Matplotlib for Basic Visualisation

11.2 Creating Word Clouds

11.3 Filtering Frequency Data Through Stop‐Words

11.4 Working with Relative Frequencies

11.5 Comparing Frequency Data Visually

11.6 Discussions

12 Conclusion

Appendix – Program Code

Index

End User License Agreement

List of Tables

Chapter 2

Table 2.1 Most useful data types.

Table 2.2 Some useful string methods.

Table 2.3 Character positions in ASCII and Latin 1.

Table 2.4 Important functions for working with numbers.

Table 2.5 String operators.

Table 2.6 Mathematical operators.

Table 2.7 Logical operators.

Chapter 3

Table 3.1 List of compound data types.

Table 3.2 Useful list methods.

Chapter 4

Table 4.1 More string methods.

Table 4.2 Index positions for slices.

Table 4.3 Case handling methods.

Chapter 5

Table 5.1 Common error types.

Chapter 6

Table 6.1 Regex methods and functions.

Table 6.2 Methods of the

re

match object.

Chapter 7

Table 7.1 Useful dictionary methods.

Chapter 9

Table 9.1 Some useful widgets.

Table 9.2 PyQT layout options.

List of Illustrations

Chapter 1

Figure 1.1 Sample text analysis in the Voyant Tools.

Figure 1.2 Python installer running on Windows.

Figure 1.3 Python installer running on macOS.

Figure 1.4 Activating the command prompt via the Windows Start menu.

Figure 1.5 Finding the Path settings.

Chapter 3

Figure 3.1 The Debug Environment dialogue in the WingIDE.

Chapter 5

Figure 5.1 File hierarchy for a Windows drive.

Figure 5.2 Folder content display on Windows.

Figure 5.3 Folder content display on macOS.

Figure 5.4 Folder listings on Windows and Ubuntu Linux.

Chapter 9

Figure 9.1 A minimal GUI program.

Figure 9.2 File menu of the Widget Demo program.

Figure 9.3 The frequency list GUI.

Figure 9.4 Layout for GUI inversion.

Chapter 10

Figure 10.1 Sample HTML page.

Figure 10.2 The Downloader GUI.

Figure 10.3 Abridged sample XML document.

Figure 10.4 TEI header for the document to be produced in Exercise 63.

Figure 10.5 Beginning of the text body for the TEI version of Frankenstein....

Chapter 11

Figure 11.1 Illustration of

scatter

,

plot

, and

bar

methods in Matplotlib.

Figure 11.2 Absolute versus relative frequencies in comparing two novels.

Figure 11.3 Frequency comparison as stacked bar chart.

Figure 11.4 Original Pandas

DataFrame

created from two dictionaries.

Figure 11.5 Transposed

DataFrame

.

Guide

Cover Page

Table of Contents

Title Page

Dedication

Dedication Page

List of Figures

About the Companion Website

Begin Reading

Appendix – Program Code

Index

WILEY END USER LICENSE AGREEMENT

Pages

iii

iv

v

xi

xi

1

2

3

4

5

6

7

8

9

10

11

12

13

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

129

130

131

132

133

134

135

136

137

138

139

140

141

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

Python Programming for Linguistics and Digital Humanities

Applications for Text‐Focused Fields

Martin Weisser

Copyright © 2024 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging‐in‐Publication DataNames: Weisser, Martin, author.Title: Python programming for linguistics and digital humanities : applications for text‐focused fields / Martin Weisser.Description: Hoboken, New Jersey : Wiley‐Blackwell, 2023. | Includes index.Identifiers: LCCN 2023025982 (print) | LCCN 2023025983 (ebook) | ISBN 9781119907947 (paperback) | ISBN 9781119907954 (adobe pdf) | ISBN  9781119907961 (epub)Subjects: LCSH: Python (Computer program language) | Computer programming. | Computational linguistics.Classification: LCC QA76.73.P98 W45 2023 (print) | LCC QA76.73.P98 (ebook) | DDC 005.13/3‐‐dc23/eng/20230612LC record available at https://lccn.loc.gov/2023025982LC ebook record available at https://lccn.loc.gov/2023025983

Cover Design: WileyCover Images: © ersin ergin/Shutterstock; © 2023 Martin Weisser; “Python” and Python logos are trademarks or registered trademarks of the Python Software Foundation and are used with permission

To Ye,without whose constant support over the yearswriting books like this would not have been possible

List of Figures

Figure 1.1

Sample text analysis in the Voyant Tools.

Figure 1.2

Python installer running on Windows.

Figure 1.3

Python installer running on macOS.

Figure 1.4

Activating the command prompt via the Windows Start menu.

Figure 1.5

Finding the Path settings.

Figure 3.1

The Debug Environment dialogue in the WingIDE.

Figure 5.1

File hierarchy for a Windows drive.

Figure 5.2

Folder content display on Windows.

Figure 5.3

Folder content display on macOS.

Figure 5.4

Folder listings on Windows and Ubuntu Linux.

Figure 9.1

A minimal GUI program.

Figure 9.2

File menu of the Widget Demo program.

Figure 9.3

The frequency list GUI.

Figure 9.4

Layout for GUI inversion.

Figure 10.1

Sample HTML page.

Figure 10.2

The Downloader GUI.

Figure 10.3

Abridged sample XML document.

Figure 10.4

TEI header for the document to be produced in Exercise 63.

Figure 10.5

Beginning of the text body for the TEI version of Frankenstein.

Figure 11.1

Illustration of scatter, plot, and bar methods in Matplotlib.

Figure 11.2

Absolute versus relative frequencies in comparing two novels.

Figure 11.3

Frequency comparison as stacked bar chart.

Figure 11.4

Original Pandas DataFrame created from two dictionaries.

Figure 11.5

Transposed DataFrame.

About the Companion Website

This book is accompanied by a companion website.

https://www.wiley.com/go/weisser/pythonprogling

This website includes:

Text

Codes

1Introduction

This book is designed to provide you with an overview of the most important basic concepts in Python programming for Linguistics and text‐focussed Digital Humanities (henceforth DH) research. To this end, we'll look at many practical examples of language analysis, starting with very simple concepts and simplistic programs, gradually working our way towards more complex, ‘applied’, and hopefully useful projects. I'll assume no extensive prior knowledge about computers other than that you'll know how to perform basic tasks like starting the computer and running programs, as well as some slight familiarity with file management, so no in‐depth knowledge in mathematics or computer science is required. All necessary concepts will be introduced gently and step‐by‐step.

Before we go into discussing the structure and content of the book, though, it's probably advisable to spend a few minutes thinking about why, as someone presumably more interested in the Arts and Humanities than technical sciences, you should actually want to learn how to write programs in Python.

1.1 Why Program? Why Python?

Nowadays, more and more of the research we carry out in the primarily language‐ or text‐oriented disciplines involves working with electronic texts. And although many tools exist for analysing such documents, these are often limited in their functionality because they may either have been produced for very specific purposes, or designed to be as generic as possible, and so that they may also be applied to as great a variety of tasks as possible. In both cases, these tools will have been created only bearing in mind the functionality that their creators have actually envisaged as being necessary, but generally don't offer many options for customising them towards one's own needs. In addition, while the results they produce might be suitable for carrying out the kind of distant reading often propagated in DH, without any in‐depth knowledge of how these programs have arrived at the snapshots or summaries of the data they have produced – as well as which potential errors may have been introduced in the process – one is never completely in control of the underlying data and their potentially idiosyncratic characteristics. To illustrate this point, let's take a look at the analysis output of a popular DH tool, the Voyant Tools (https://voyant‐tools.org), displayed in Figure 1.1.

Figure 1.1 Sample text analysis in the Voyant Tools.

The text in Figure 1.1 is part of the German Text Archive (Deutsches Textarchiv; DTA), which provides direct links to the Voyant Tools as a convenient way to visualise prominent features of a text, such as the most frequent ‘words’ and their distribution within the text. For our present purposes, it is actually irrelevant that the language is German because you don't need to be able to understand the text itself at all, but merely observe that the tool ‘believes’ that the most prominent words therein are a, b, c, x, and 1. This can be seen in the word cloud on the top left‐hand side, the summary below it, and the distributional graph on the top right‐hand side. Now, of course, most of us would not see these most frequent items as words at all, but rather as letters and a number, all of which hardly represent any information about the content of the text, which is usually what the most frequent words should do, at least to some extent, as we'll see in Chapter 8 when we learn to create our own frequency lists, and then develop them further to fit our needs in later chapters. The reason for these items occurring so frequently in the different visualisations in Figure 1.1 is that the text is actually about mathematics, and hence comprises many equations and other paradigms that contain these letters, but, as pointed out before, have relatively little meaning in and of themselves other than in these particular contexts. To be able to capture the ‘aboutness’ of the text itself in a form of distant reading, we'd need to remove these particular high‐frequency items, so that the actual content words in the text might become visible. However, the Voyant Tools simply don’t seem to allow us to, and hence appear to be – at least at first glance – designed around a rather naïve notion about what constitutes a word and how it becomes relevant in a context. Only if you hover over the question marks in the interface do you actually see that there are indeed options provided for setting the necessary filters. In addition, if you look at the distributional graph on the top right‐hand side, you may note that the frequencies are plotted against “Document Segments”, but we really have no indication as to what these segments may be. It rather looks like the document may simply have been split into 10 equally sized parts from which the frequencies have been extracted, but such equally sized parts don't actually constitute meaningful segments of the text, such as chapters or sections would do. Furthermore, the concordance – i.e. the display of the individual occurrences in a limited context – for the “Term” a displayed on the bottom right‐hand side is misleading because the first four lines in fact don't represent instances of the mathematical variable a that accounts for the majority of instances of this ‘word’, but instead constitute the initial A., which appears to have been downcased automatically by the tool, something that is fairly common practice in language analysis to be able to count sentence‐initial and sentence‐internal forms together, but clearly produces misleading results because this particular type of abbreviation is not treated differently from other word forms.

This example will already have demonstrated to you how important it is to be in control of the data we want to analyse, and that we cannot always rely on programs – or program modules (see Section 7.4) – that others have written. Yet another reason for writing our own programs, though, is that, even if some programs might allow us to do part of the work, they may not do everything we need them to do, so that we end up working with multiple programs that could even produce different output formats that we'd then need to convert into a different, suitable, form before being able to feed data from one program into the next. Moreover, apart from being rather cumbersome and tedious, such a convoluted process may also be highly error prone.

In terms of what we might want to achieve through writing our own programs, there are a few things that you may already have observed in the above example, but in order to make such potential objectives a little clearer and expand on them, let's frame them as a series of “How can we …”‐questions:

… generate customised word frequency lists or graphs thereof to facilitate topic identification/distant reading?

… gather document/corpus statistics for syllables, words, sentences, or paragraphs, and output them in a suitable format?

… identify (proto‐)typical meanings, uses, and collocations of words?

… extract or manipulate parts of texts to create psycholinguistic experiments, or for teaching purposes?

… convert simple documents into annotated formats that allow specific types of analysis?

… create graphical user interfaces (GUIs) to edit or otherwise interact with our data?

We certainly won't be able to answer all these questions fully in this book, but at least work towards developing a means of achieving partial solutions to them.

Having discussed why we should write our own programs at all, let's now think briefly about why Python may be the right choice for this task. First of all, despite the fact that Python has already been around for more than 30 years at the time of writing this book, it is a very modern programming language that implements a number of different programming paradigms – i.e. different approaches to writing programs – about which, however, we won't go into much detail here because they are beyond the scope of this book. More importantly, though, Python is relatively easy to learn, available for all common platforms, and the programs you write in it can be executed directly without prior compilation, i.e. having to create one single program from all the parts by means of another program. This makes it easier to port your programs to different operating systems and test them quickly.

In terms of the programming paradigms briefly referred to above, it is important to note that Python is object‐oriented (see Chapter 7) but can be used procedurally. In other words, although using object orientation in Python provides many important opportunities for writing efficient, robust, and reusable programs, unlike in languages like Java, it's not necessary to understand how to create an object and all the logic this entails before actually beginning to write your programs. This is another reason why the Python learning curve is less steep than that for some other popular programming languages that could otherwise be equally suitable.

Despite my initial cautionary note about using other people's modules, of course we don't always want to reinvent the wheel when it comes to particular tasks that someone else may already have solved in an appropriate way. Thus, as long as we can ensure that these modules in fact do what we expect them to do, there are many additional modules available for Python that may simplify specific problems, such as parsing out the content of web pages in order to extract only the parts we may require, etc.

Last, but not least, another important advantage of Python is that it is becoming increasingly popular with linguists and computational linguists, so that you may a) be able to find many suitable modules to simplify your tasks, and b) – more importantly – there are many opportunities to cooperate with like‐minded researchers in your programming efforts or get advice from more experienced programmers.

1.2 Course Overview and Aims

In this section, I'll first present an overview of the book. As many of you are probably less familiar with issuing commands in text form to interact with their computer's operating system, prior to delving into our actual programming efforts, I’ll first introduce the most important concepts involved in working with the computer in this way, and installing the software required for our purposes. Following this, Chapters 2 and 3 will introduce you to programming fundamentals – statements, variables, control structures, etc. – thereby enabling us to develop strategies for solving language‐related questions computationally in their most basic form. In Chapter 2, you'll also learn some of the basics of working with strings, which represent the most useful data type for our language‐related purposes.

Chapter 4 is designed to allow you to grasp more intermediate concepts in string processing, laying the foundation for processing words and short pieces of texts to do basic morphological analysis, clean up data, break sentences into words, as well as create formatted output as the most elementary form of visualising language data. In Chapter 5, you'll then learn how to work with longer pieces of data, stored in the form of text files, for handling and saving results, including a discussion of how to handle the folder structure on your computer efficiently and in a platform‐independent manner.

The next chapter will introduce you to regular expressions, a powerful way of recognising simple to highly complex linguistic patterns, and processing them. This knowledge will enable you to perform tasks that are especially relevant to advanced language processing, and go way beyond the options provided by Python's basic string processing methods, such as searching through one or more files in order to extract and display information based on more or less complex patterns you'll learn to specify.

In Chapter 7, we'll move on from learning about the essential concepts towards applying these in developing our own applications, even if these may initially still be relatively simplistic. We'll start this part of the book by discussing the essentials of modularity and object orientation, thereby providing a foundation for writing more efficient programs and reusable components for increasingly complex and repetitive programming tasks. Here, for instance, we'll learn how to design user‐defined functions that allow us to handle simple lexica for performing (equally basic) word‐class annotation tasks, or how to set up our own object to model the behaviour of specific types of words. Chapter 8 will then turn to creating word and frequency lists, and developing an understanding of different sorting options. This will allow us to create useful objects to quantify and identify linguistic phenomena in various ways, as well as to display them in ways that are appropriate for different analysis tasks.

In Chapter 9, I'll introduce you to creating graphical user interfaces (GUIs) as a means to facilitate handling and interacting with data. While this may seem like something you don't really need for analysing language data, in my experience it is invaluable in providing yourself – as well as any potential users of your programs – with ways of interactively outputting and exploring data in forms that are often not possible on the command line, especially if you're dealing with different languages or older forms of language. By necessity, though, we'll have to restrict our endeavours here to producing relatively simple GUIs, but hopefully you'll be able to use the information provided here to develop your understanding further independently, so that you'll later be able to create more complex ones that fit your exact analysis needs or those of any projects to which you may be contributing.

In Chapter 10, we'll learn how to download and handle web data, and produce – as well as work with – annotations. As more and more data these days originates from the web, and many levels of language analysis require some form of interpretive coding, these two areas also represent very important aspects of programming for language analysis purposes. The final section also includes an introduction to the annotation scheme of the Text Encoding Initiative (TEI), a scheme commonly used for corpora and other texts in DH.

The final main chapter will introduce some basic concepts in creating visualisations, such as producing frequency plots using the matplotlib library or generating word clouds.

1.3 A Brief Note on the Exercises

Other programming books may provide you with the necessary theory, walk you through code/coding examples step by step, and then give you some more advanced exercises that essentially send you off on your own to explore things further, but then never offer any solutions. In my experience, such an approach is less effective because it runs the risk that you may simply end up doing simpler exercises mechanically, or end up learning only half of what may be relevant because the main exercises are too limited.

My approach in this book is rather different from this – perhaps more academic – because I generally start by introducing the most essential aspects of the programming constructs covered first, but then ask you to apply these concepts immediately to particular questions or projects in processing textual data, as and when necessary or relevant even introducing additional details inside the exercises. This way, you'll not only be forced to apply the concepts, but also to think about how this can best be achieved in solving language‐related issues. The more we progress through the book, the more complex these mini‐projects may get, and they will frequently also be designed to build upon many concepts covered in prior sections, so, in a sense, they also serve as a kind of repetition for you.

However, because the exercises may get rather complex, I will also provide detailed discussions of them at the end of each chapter. Here I not only show you the code that I consider most appropriate, based on your current level of knowledge, or perhaps even some possible alternatives, but will also explain important issues pertaining to these solutions. In addition, I'll discuss potential pitfalls or any error messages you may encounter, especially in the earlier chapters. Hence, even if you may be able to complete the exercises without any help, you should probably still read through the discussion each time you've completed an exercise to learn about these additional aspects before continuing with the main text.

All programs we produce as part of these exercises are listed in complete form in the Appendix, and will also be available, along with any other resources, from the book's companion website at http://www.wiley.com/go/weisser/pythonprogling. To challenge you a little more, I'll frequently also provide suggestions as to how you can develop the programs we devise together into more advanced little projects that you can carry out on your own in order to develop your programming skills further independently.

1.4 Conventions Used in this Book

In this book, I'll use the general conventions for representing different types of information for linguistics purposes, as well as a few other ones designed to make it easier for you to distinguish between the descriptive text and the coding constructs presented. Language samples or passages used as examples are represented in italics. To distinguish between different linguistic levels of description, if necessary, I use the appropriate bracketing, e.g. curly brackets ({…}) for morphology and angle brackets (<…>) for graphemes. In Chapter 10, however, the latter will generally represent parts of the syntax of the markup languages HTML and XML.

Key terminology will be highlighted like this, so you can identify it more easily, and expressions that deviate slightly from the standard meaning will appear in scare quotes (‘…’). To facilitate distinguishing between descriptive text and programming constructs, I will use this font, with variable elements in the code, especially in syntax descriptions, being marked through italics. Syntax summaries are further distinguished via a box with a .

To be able to make coding examples stand out even more clearly, in many cases, I'll write these on a separate line, even if they form part of a longer sentence. In such cases, I'll frequently also omit some punctuation marks, such as commas or full stops, so that these don't appear to be part of the program code.

1.5 Installing Python

Installing Python on your computer should – on the whole – not represent a big problem because installation packages for the different platforms can conveniently be downloaded from https://www.python.org, and the installation itself presents no major obstacles if you observe a few simple points, of course provided that you have administrative rights on the computer you're using. If you're using a shared computer and have no administrative rights yourself, then you'll need to consult your administrator.

Python is frequently already preinstalled on Linux and macOS, but unfortunately often only version 2, which also tends to be required by the operating system (OS), and is therefore non‐replaceable! In these cases, the solution is to carry out a parallel installation of 3 alongside version 2 because we'll be using Python 3 for this book. If you're running such a parallel installation of Python 3, you'll also need to set up the so‐called shebang line (explained in Section 2.5) correctly, so that your OS will know which version of Python to use for running your programs. In the following sections, I'll describe the installation process for the different OSes covered here one at a time. At the time of writing, Python 3.11, which is supposedly much faster than previous versions, had already become available. However, not all of the Python modules used in later parts of the book were available for this version, so that I'd currently still recommend maximally installing Python 3.9, which I've tested with all modules.

1.5.1 Installing on Windows

From Python 3.8 onwards, Python will be set up in your user directory by default, e.g. ‘C:\Users\username\AppData\Local\Programs\Python\Pythonnumber’, where username is your own username and number the version number without the dot, i.e. 38 for version 3.8, and 39 for 3.9. As an administrator on your computer, you can also switch this to installing it for all users by checking the customisation option shown towards the bottom of Figure 1.2, in which case it would normally be installed into the folder ‘C:\Program Files\Python39’.

Figure 1.2 Python installer running on Windows.

You should also ensure that the box for “Add Python X to PATH” – where X stands for the version number – is checked in order for Windows to be able to find the Python interpreter, the program that converts your Python instructions into executable code, and allow you to launch Python programs by double‐clicking from Windows Explorer.

1.5.2 Installing on the Mac

On the Mac, you just need to follow the basic instructions shown in Figure 1.3.

Figure 1.3 Python installer running on macOS.

When installing Python on macOS, there is no issue associating your files with an interpreter or setting the path because macOS and Linux handle the execution of programs differently from Windows, through the shebang line, through which you tell the OS which interpreter to use.

1.5.3 Installing on Linux

To install Python 3 on a Linux system, you should use whatever packet manager is appropriate. However, as Linux distributions differ rather strongly from one another, I cannot describe the installation process in any detail here. As on the Mac, Linux uses the shebang line to ensure that the right version of Python will execute your programs later.

Exercise 1 Installing Python

Go to https://www.python.org.

Find the most up‐to‐date Python 3 version for your OS. Note: If you're using Linux, you won't find an installer on this website, but you should use your package manager for locating one instead.

Download and install Python, making sure that you select the option for adding Python to your path if you're on Windows, and installing certificates on the Mac after completion!

Now that you should have a copy of Python 3 on your computer, we can verify that the installation process ran correctly, and then start investigating how to use it. Before we can do so, though, we first need to explore how it is possible to issue the right commands to your computer in the form of text you type in, which may well be something that you're still unused to.

1.6 Introduction to the Command Line/Console/Terminal

Most computer users these days are probably more familiar with interacting with their OSes through windows‐based systems, i.e. so‐called Graphical User Interfaces (GUIs – /guːiz/ – for short). However, before such GUIs became prevalent in computing, it was customary to interact with the OS by typing in commands at what is referred to as the command line on Windows, as the console or terminal on Linux, and Terminal on Mac. For the sake of simplicity, from now on, I'll refer to this as the command line.

The command line allows users to input text‐based commands via the command prompt, which is generally signalled via a flashing cursor, and will initially be your only way of running Python or any simpler programs written in Python that don't have a GUI themselves. We'll later learn another, slightly more comfortable, way of starting and testing your programs through WingIDE Personal, the program that I'm recommending you use for writing your Python code. In addition, working with the command line will allow us to learn about some important concepts related to handling files and folders on your computer, which will form an important part of your programming once you start working with stored data from Chapter 5 onwards.

In order to issue commands to all three OSes, you type their name, plus any potential arguments, i.e. other required information such as filenames, etc., and then press Enter to trigger the command. In the next two sections, I'll describe how to access the command line, first for Windows, then for Mac and Linux.

1.6.1 Activating the Command Line on Windows

To activate the command line on Windows, there are multiple options. Perhaps the simplest one for most users initially is to press the key or click on the Start () button, type cmd next to the magnifying glass symbol, and click on Command prompt in the dialogue box shown in Figure 1.4.

Figure 1.4 Activating the command prompt via the Windows Start menu.

Depending on how many programs or files Windows finds that start with the letter c, this option may already be presented to you even if you only press c or cm. As you can see, there are multiple actions available for the command prompt on the right‐hand side of the start menu, other than just clicking to open it. To simplify opening the command prompt, you could for instance pin the icon to your taskbar if isn't already too crowded, and then have it available with one single mouse click. Another, more important, option you may need later if you've installed Python as an administrator, is that you can also run the command prompt in that capacity, which will then allow you to install additional packages for all users.

Another quick way to access the command prompt is to press + r, then type cmd in the ‘Run’ dialogue that will open, and press the Enter key (↲) or click on ‘OK’. If you hold down the ‘Shift’ key (⇧) and the ‘Ctrl’ key while pressing ‘Enter’, you can also open the command prompt as an administrator.

If you're already looking at a folder that contains your programs in Windows Explorer, it's even more convenient to type cmd in the Explorer address bar and press the Enter key. This will open up a command prompt directly at the folder location, so you won't actually need to navigate there once the command line has been opened, which we'll nevertheless soon practise.

1.6.2 Activating the Command Line on the Mac or Linux

To activate the command line on the Mac or Linux, you need to start Launchpad (Mac) or click the start button (Linux), search for Terminal (or a similar name), then execute Terminal. On the Mac, you can also add the Terminal to the dock for convenience. This is usually also the case for Linux panels, but may depend on your exact Linux version.

Exercise 2 Verifying That Python Is Installed

Open a command line for your OS.

Type in python ‒V, and press Enter. You should then see the version number of your Python installation reported at the prompt.

Now that you know how to issue commands on the command line, let's take a brief look at which types of programs you can use to write your Python programs in.

1.7 Editors and IDEs

For most writing tasks on the computer, we tend to use dedicated word processors like Microsoft Word® or LibreOffice Writer that enable us to apply appealing layouts and formatting to whatever we write. These programs, however, generally store the texts produced in them in such a way that they can more or less only be opened and edited further by whatever program was used to create them, apart from containing many formatting instructions required to generate the display or print them. For writing program code that needs to be readable by the Python interpreter, and ideally editable by different programs available on the different OSes we may want to use, word processors are therefore not useful. The kind of text we need, which contains no formatting or fancy layout apart from perhaps line breaks or indentations, is called plain text, and the programs we can use to edit them are called (plain text) editors. Examples of these would be Windows Notepad®, or TextEdit on macOS. Some of these editors even offer special support for programming languages, such as syntax highlighting for different programming or markup languages (see Section 10.1), but this support, if it exists at all, still tends to be fairly limited.

Better suited to programming tasks are so‐called Integrated Development Environments (IDEs, for short). These offer additional programming support, such as finding errors in code (debugging), advanced syntax highlighting and indentation, syntax completion, etc., sometimes for multiple programming and/or markup languages.

1.8 Installing and Setting Up WingIDE Personal

One such IDE that is optimised for Python and markup languages is the WingIDE. It exists in different versions, and is available for Windows, MacOS, and Linux. The version that is of interest to us here is the Personal edition, as it's freeware, just like Python, so that you won't have to invest anything other than your time into learning how to program in Python, but can still enjoy a number of features that will greatly simplify your programming tasks. Of course, there are also other freely available IDEs, or you may already have a preferred IDE, so using WingIDE is only a recommendation I'd like to make having evaluated a number of other IDEs. In case you (or your instructor) decide to use a different IDE, you can of course skip the remainder of this chapter, and move straight on to the next one. Before you do so, though, you should at least check to see if the output encoding for your chosen IDE has been set to UTF‐8, which is usually done in the IDE's program preferences. Read on just a little to find out why this may be sensible to do.

The Personal edition of the WingIDE can be downloaded from https://wingware.com/downloads/wing‐personal. The exact installation routine depends on your OS, but is generally quite straightforward, so we won't discuss it here, instead carrying it out as part of our next exercise. However, there may be a few settings to modify after the installation. The most important of these is that, at least prior to version 8, the default encoding (see Section 2.3.1) in which files are saved is automatically set to the local encoding on your computer, which is not an optimal choice. Hence, you'll minimally want to change this, setting it to UTF‐8 via ‘Edit → Preferences → Files → Default Encoding’ in order to be able to use non‐English characters properly as well in your code. It's also possible to customise many of the display options, such as setting a larger font, changing the editor background to make it easier on your eyes, displaying line numbers, or even setting a different display language, but these are largely questions of preference, so, again, we won't discuss them here.

Exercise 3 Installing and Getting to Know WingIDE Personal

Download the version of WingIDE Personal appropriate for your OS and install it.

Change the settings for the encoding if necessary.

Familiarise yourself a little with the features and functions of the IDE by looking through the menus, etc.

As we've now covered all the preliminaries, in the next chapter, we can finally begin to learn about some of the essential concepts required to allow you to begin programming.

1.9 Discussions

Discussion 1 Installing Python

Provided that you've followed my instructions carefully, there are only a few things that could have gone wrong during the installation, unless of course you're not authorised to install any software on your computer at all, in which case you'll need to ask your administrator to set up Python for you. At the time of completing this book, the most recent Python version was 3.11, but you may need to install a version lower than 3.9 if you should still be running Windows 7, which I wouldn't recommend, anyway.

On Windows, should you have forgotten to tick the box to add Python to your path, ensuring that Windows finds the Python interpreter and that your programs will also run if you double‐click on them in Explorer will get a little complicated. To do so, you need to go into the Windows settings, most easily accomplished by pressing + i, typing path into the search box (see Figure 1.5), and selecting either ‘Edit the environment variables for your account’ (for non‐administrators) or ‘Edit the system environment variables’ (for administrators). As a non‐administrator, you can then select the Path option in the box at the top, click on ‘Edit…’, and add the path to your Python installation to the end of the path. As an administrator, there's one intermediate dialogue, where you need to click on ‘Environment Variables…’ first, and then follow the same steps as just described above.

Figure 1.5 Finding the Path settings.

Installing the certificates on the Mac, if you've omitted that step, may unfortunately prove necessary in order to be able to download files from the internet as part of the exercises in Chapter 10, so if you should have forgotten to do so, please install them asap.

Discussion 2 Verifying That Python Is Installed

Provided that your installation was successful, using the command python ‐V should output the Python version number, e.g. Python 3.9.7 on my computer, which indicates that I'm running version 3, with minor version 9, sub‐version 7.

Should you inadvertently have typed the wrong program name, perhaps pythion, you'll get an error message from the OS, indicating that the program name is not recognised. If you type a different capital letter after python, Python will display some usage information, but in case you've typed a small v or forgotten the argument completely, you'll end up with a different prompt that starts with >>>. This means that you've started the interactive Python interpreter, called the Python Shell, where you can actually already type in the Python commands we'll learn about later, and test different Python constructs. To close this interpreter and return to the OS prompt, simply type in exit() and press Enter.

Discussion 3 Installing and Getting to Know WingIDE Personal

Downloading, installing, and editing the encoding settings, if required, should be relatively straightforward, provided you download the right installer, follow all the instructions, and of course, have the necessary permissions to install programs on your computer.

When you open WingIDE Personal to explore, you should see that there are multiple sections or panels that offer different types of functionality over and above simply being able to create and edit program code. Initially, the most important part of the interface will be the editor window itself on the top left‐hand side below the menu and toolbars, although we'll later also make use of other components of the IDE window. This window can actually be split, so that you can view multiple files side‐by‐side in order to compare them or copy and paste from one to the other, or also view different sections of a longer program.

On the right‐hand side, spanning all the way from top to bottom of the program window, you'll see a window with a few ‘utility’ tabs that allow you

to manage a project (‘Project’) – something we won't discuss here;

get help on specific Python programming constructs (‘Source Assistant’);

explore or jump to different sub‐parts of your program (‘Source Browser’);

and potentially manage indentation issues (‘Indentation’).

The ‘Indentation’ tab, however, is something you'll probably only need to use if you work with code that may have been created in other editors or by other people.

The bottom left‐hand side is split into two panes, each containing multiple tabs, with the left‐hand pane containing tabs for searching (and replacing, if activated) in the currently active file or a number of files at the same time, as well as the ‘Stack Data’ tab, which is used for advanced debugging purposes we won't discuss in this book. In the right‐hand pane, the two tabs we'll discuss and use later on are ‘Debug I/O’ and ‘Python Shell’, whereas we won't cover the other two again.

The menu bar at the very top, as well as the toolbar below it, contain a number of familiar entries or buttons that essentially exist in most GUI programs, but also a few items that you'll probably still be unfamiliar with, and which relate to various aspects of handling the programming code. If you haven't done so already, I'd suggest that you at the very least try to read through the menus to see which entries you understand and may be useful to you in handling code, and also possibly which keyboard shortcuts you may want to use to increase your efficiency. Of course, there'll be quite a few things that won't make sense to you yet, but you can always try to understand them later, once you've made some progress in your programming career.