32,99 €
Learn how to use Python for linguistics and digital humanities research, perfect for students working with Python for the first time
Python programming is no longer only for computer science students; it is now an essential skill in linguistics, the digital humanities (DH), and social science programs that involve text analytics. Python Programming for Linguistics and Digital Humanities provides a comprehensive introduction to this widely used programming language, offering guidance on using Python to perform various processing and analysis techniques on text. Assuming no prior knowledge of programming, this student-friendly guide covers essential topics and concepts such as installing Python, using the command line, working with strings, writing modular code, designing a simple graphical user interface (GUI), annotating language data in XML and TEI, creating basic visualizations, and more.
This invaluable text explains the basic tools students will need to perform their own research projects and tackle various data analysis problems. Throughout the book, hands-on exercises provide students with the opportunity to apply concepts to particular questions or projects in processing textual data and solving language-related issues. Each chapter concludes with a detailed discussion of the code applied, possible alternatives, and potential pitfalls or error messages.
Python Programming for Linguistics and Digital Humanities: Applications for Text-Focused Fields is a must-have resource for students pursuing text-based research in the humanities, the social sciences, and all subfields of linguistics, particularly computational linguistics and corpus linguistics.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 571
Veröffentlichungsjahr: 2023
Cover
Table of Contents
Title Page
Copyright Page
Dedication Page
List of Figures
About the Companion Website
1 Introduction
1.1 Why Program? Why Python?
1.2 Course Overview and Aims
1.3 A Brief Note on the Exercises
1.4 Conventions Used in this Book
1.5 Installing Python
1.6 Introduction to the Command Line/Console/Terminal
1.7 Editors and IDEs
1.8 Installing and Setting Up WingIDE Personal
1.9 Discussions
2 Programming Basics I
2.1 Statements, Functions, and Variables
2.2 Data Types – Overview
2.3 Simple Data Types
2.4 Operators – Overview
2.5 Creating Scripts/Programs
2.6 Commenting Your Code
2.7 Discussions
3 Programming Basics II
3.1 Compound Data Types
3.2 Lists
3.3 Simple Interaction with Programs and Users
3.4 Problem Solving and Damage Control
3.5 Control Structures
4 Intermediate String Processing
4.1 Understanding Strings
4.2 Cleaning Up Strings
4.3 Working with Sequences
4.4 More on Tuples
4.5 ‘Concatenating’ Strings More Efficiently
4.6 Formatting Output
4.7 Handling Case
4.8 Discussions
5 Working with Stored Data
5.1 Understanding and Navigating File Systems
5.2 Stored Data
5.3 Opening and Closing Files
5.4 Reading File Contents
5.5 Error Handling
5.6 Writing to Files
5.7 Working with Folders and Paths
5.8 Discussions
6 Recognising and Working with Language Patterns
6.1 The
re
Module
6.2 General Syntax
6.3 Understanding and Working with the Match Object
6.4 Character Classes
6.5 Quantification
6.6 Masking and Using Special Characters
6.7 Regex Error Handling
6.8 Anchors, Groups and Alternation
6.9 Constraining Results Further
6.10 Compilation Flags
6.11 Discussions
7 Developing Modular Programs
7.1 Modularity
7.2 Dictionaries
7.3 User‐defined Functions
7.4 Understanding Modules
7.5 Documenting Your Module
7.6 Installing External Modules
7.7 Classes and Objects
7.8 Testing Modules
7.9 Discussions
8 Word Lists, Frequencies and Ordering
8.1 Introduction to Word and Frequency Lists
8.2 Generating Word Lists
8.3 Sorting Basics
8.4 Generating Basic Word Frequency Lists
8.5 Lambda Functions
8.6 Discussions
9 Interacting with Data and Users Through GUIs
9.1 Graphical User Interfaces
9.2 PyQt Basics
9.3 Designing More Advanced GUIs
9.4 Discussions
10 Web Data and Annotations
10.1 Markup Languages
10.2 Brief Intro to HTML
10.3 Using the
urllib.request
Module
10.4 Extracting Text from Web Pages
10.5 List and Dictionary Comprehension
10.6 Brief Intro to XML
10.7 Complex Regex Replacements Using Functions
10.8 Brief Intro to the TEI Scheme
10.9 Discussions
11 Basic Visualisation
11.1 Using Matplotlib for Basic Visualisation
11.2 Creating Word Clouds
11.3 Filtering Frequency Data Through Stop‐Words
11.4 Working with Relative Frequencies
11.5 Comparing Frequency Data Visually
11.6 Discussions
12 Conclusion
Appendix – Program Code
Index
End User License Agreement
Chapter 2
Table 2.1 Most useful data types.
Table 2.2 Some useful string methods.
Table 2.3 Character positions in ASCII and Latin 1.
Table 2.4 Important functions for working with numbers.
Table 2.5 String operators.
Table 2.6 Mathematical operators.
Table 2.7 Logical operators.
Chapter 3
Table 3.1 List of compound data types.
Table 3.2 Useful list methods.
Chapter 4
Table 4.1 More string methods.
Table 4.2 Index positions for slices.
Table 4.3 Case handling methods.
Chapter 5
Table 5.1 Common error types.
Chapter 6
Table 6.1 Regex methods and functions.
Table 6.2 Methods of the
re
match object.
Chapter 7
Table 7.1 Useful dictionary methods.
Chapter 9
Table 9.1 Some useful widgets.
Table 9.2 PyQT layout options.
Chapter 1
Figure 1.1 Sample text analysis in the Voyant Tools.
Figure 1.2 Python installer running on Windows.
Figure 1.3 Python installer running on macOS.
Figure 1.4 Activating the command prompt via the Windows Start menu.
Figure 1.5 Finding the Path settings.
Chapter 3
Figure 3.1 The Debug Environment dialogue in the WingIDE.
Chapter 5
Figure 5.1 File hierarchy for a Windows drive.
Figure 5.2 Folder content display on Windows.
Figure 5.3 Folder content display on macOS.
Figure 5.4 Folder listings on Windows and Ubuntu Linux.
Chapter 9
Figure 9.1 A minimal GUI program.
Figure 9.2 File menu of the Widget Demo program.
Figure 9.3 The frequency list GUI.
Figure 9.4 Layout for GUI inversion.
Chapter 10
Figure 10.1 Sample HTML page.
Figure 10.2 The Downloader GUI.
Figure 10.3 Abridged sample XML document.
Figure 10.4 TEI header for the document to be produced in Exercise 63.
Figure 10.5 Beginning of the text body for the TEI version of Frankenstein....
Chapter 11
Figure 11.1 Illustration of
scatter
,
plot
, and
bar
methods in Matplotlib.
Figure 11.2 Absolute versus relative frequencies in comparing two novels.
Figure 11.3 Frequency comparison as stacked bar chart.
Figure 11.4 Original Pandas
DataFrame
created from two dictionaries.
Figure 11.5 Transposed
DataFrame
.
Cover Page
Table of Contents
Title Page
Dedication
Dedication Page
List of Figures
About the Companion Website
Begin Reading
Appendix – Program Code
Index
WILEY END USER LICENSE AGREEMENT
iii
iv
v
xi
xi
1
2
3
4
5
6
7
8
9
10
11
12
13
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
129
130
131
132
133
134
135
136
137
138
139
140
141
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
Martin Weisser
Copyright © 2024 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging‐in‐Publication DataNames: Weisser, Martin, author.Title: Python programming for linguistics and digital humanities : applications for text‐focused fields / Martin Weisser.Description: Hoboken, New Jersey : Wiley‐Blackwell, 2023. | Includes index.Identifiers: LCCN 2023025982 (print) | LCCN 2023025983 (ebook) | ISBN 9781119907947 (paperback) | ISBN 9781119907954 (adobe pdf) | ISBN 9781119907961 (epub)Subjects: LCSH: Python (Computer program language) | Computer programming. | Computational linguistics.Classification: LCC QA76.73.P98 W45 2023 (print) | LCC QA76.73.P98 (ebook) | DDC 005.13/3‐‐dc23/eng/20230612LC record available at https://lccn.loc.gov/2023025982LC ebook record available at https://lccn.loc.gov/2023025983
Cover Design: WileyCover Images: © ersin ergin/Shutterstock; © 2023 Martin Weisser; “Python” and Python logos are trademarks or registered trademarks of the Python Software Foundation and are used with permission
To Ye,without whose constant support over the yearswriting books like this would not have been possible
Figure 1.1
Sample text analysis in the Voyant Tools.
Figure 1.2
Python installer running on Windows.
Figure 1.3
Python installer running on macOS.
Figure 1.4
Activating the command prompt via the Windows Start menu.
Figure 1.5
Finding the Path settings.
Figure 3.1
The Debug Environment dialogue in the WingIDE.
Figure 5.1
File hierarchy for a Windows drive.
Figure 5.2
Folder content display on Windows.
Figure 5.3
Folder content display on macOS.
Figure 5.4
Folder listings on Windows and Ubuntu Linux.
Figure 9.1
A minimal GUI program.
Figure 9.2
File menu of the Widget Demo program.
Figure 9.3
The frequency list GUI.
Figure 9.4
Layout for GUI inversion.
Figure 10.1
Sample HTML page.
Figure 10.2
The Downloader GUI.
Figure 10.3
Abridged sample XML document.
Figure 10.4
TEI header for the document to be produced in Exercise 63.
Figure 10.5
Beginning of the text body for the TEI version of Frankenstein.
Figure 11.1
Illustration of scatter, plot, and bar methods in Matplotlib.
Figure 11.2
Absolute versus relative frequencies in comparing two novels.
Figure 11.3
Frequency comparison as stacked bar chart.
Figure 11.4
Original Pandas DataFrame created from two dictionaries.
Figure 11.5
Transposed DataFrame.
This book is accompanied by a companion website.
https://www.wiley.com/go/weisser/pythonprogling
This website includes:
Text
Codes
This book is designed to provide you with an overview of the most important basic concepts in Python programming for Linguistics and text‐focussed Digital Humanities (henceforth DH) research. To this end, we'll look at many practical examples of language analysis, starting with very simple concepts and simplistic programs, gradually working our way towards more complex, ‘applied’, and hopefully useful projects. I'll assume no extensive prior knowledge about computers other than that you'll know how to perform basic tasks like starting the computer and running programs, as well as some slight familiarity with file management, so no in‐depth knowledge in mathematics or computer science is required. All necessary concepts will be introduced gently and step‐by‐step.
Before we go into discussing the structure and content of the book, though, it's probably advisable to spend a few minutes thinking about why, as someone presumably more interested in the Arts and Humanities than technical sciences, you should actually want to learn how to write programs in Python.
Nowadays, more and more of the research we carry out in the primarily language‐ or text‐oriented disciplines involves working with electronic texts. And although many tools exist for analysing such documents, these are often limited in their functionality because they may either have been produced for very specific purposes, or designed to be as generic as possible, and so that they may also be applied to as great a variety of tasks as possible. In both cases, these tools will have been created only bearing in mind the functionality that their creators have actually envisaged as being necessary, but generally don't offer many options for customising them towards one's own needs. In addition, while the results they produce might be suitable for carrying out the kind of distant reading often propagated in DH, without any in‐depth knowledge of how these programs have arrived at the snapshots or summaries of the data they have produced – as well as which potential errors may have been introduced in the process – one is never completely in control of the underlying data and their potentially idiosyncratic characteristics. To illustrate this point, let's take a look at the analysis output of a popular DH tool, the Voyant Tools (https://voyant‐tools.org), displayed in Figure 1.1.
Figure 1.1 Sample text analysis in the Voyant Tools.
The text in Figure 1.1 is part of the German Text Archive (Deutsches Textarchiv; DTA), which provides direct links to the Voyant Tools as a convenient way to visualise prominent features of a text, such as the most frequent ‘words’ and their distribution within the text. For our present purposes, it is actually irrelevant that the language is German because you don't need to be able to understand the text itself at all, but merely observe that the tool ‘believes’ that the most prominent words therein are a, b, c, x, and 1. This can be seen in the word cloud on the top left‐hand side, the summary below it, and the distributional graph on the top right‐hand side. Now, of course, most of us would not see these most frequent items as words at all, but rather as letters and a number, all of which hardly represent any information about the content of the text, which is usually what the most frequent words should do, at least to some extent, as we'll see in Chapter 8 when we learn to create our own frequency lists, and then develop them further to fit our needs in later chapters. The reason for these items occurring so frequently in the different visualisations in Figure 1.1 is that the text is actually about mathematics, and hence comprises many equations and other paradigms that contain these letters, but, as pointed out before, have relatively little meaning in and of themselves other than in these particular contexts. To be able to capture the ‘aboutness’ of the text itself in a form of distant reading, we'd need to remove these particular high‐frequency items, so that the actual content words in the text might become visible. However, the Voyant Tools simply don’t seem to allow us to, and hence appear to be – at least at first glance – designed around a rather naïve notion about what constitutes a word and how it becomes relevant in a context. Only if you hover over the question marks in the interface do you actually see that there are indeed options provided for setting the necessary filters. In addition, if you look at the distributional graph on the top right‐hand side, you may note that the frequencies are plotted against “Document Segments”, but we really have no indication as to what these segments may be. It rather looks like the document may simply have been split into 10 equally sized parts from which the frequencies have been extracted, but such equally sized parts don't actually constitute meaningful segments of the text, such as chapters or sections would do. Furthermore, the concordance – i.e. the display of the individual occurrences in a limited context – for the “Term” a displayed on the bottom right‐hand side is misleading because the first four lines in fact don't represent instances of the mathematical variable a that accounts for the majority of instances of this ‘word’, but instead constitute the initial A., which appears to have been downcased automatically by the tool, something that is fairly common practice in language analysis to be able to count sentence‐initial and sentence‐internal forms together, but clearly produces misleading results because this particular type of abbreviation is not treated differently from other word forms.
This example will already have demonstrated to you how important it is to be in control of the data we want to analyse, and that we cannot always rely on programs – or program modules (see Section 7.4) – that others have written. Yet another reason for writing our own programs, though, is that, even if some programs might allow us to do part of the work, they may not do everything we need them to do, so that we end up working with multiple programs that could even produce different output formats that we'd then need to convert into a different, suitable, form before being able to feed data from one program into the next. Moreover, apart from being rather cumbersome and tedious, such a convoluted process may also be highly error prone.
In terms of what we might want to achieve through writing our own programs, there are a few things that you may already have observed in the above example, but in order to make such potential objectives a little clearer and expand on them, let's frame them as a series of “How can we …”‐questions:
… generate customised word frequency lists or graphs thereof to facilitate topic identification/distant reading?
… gather document/corpus statistics for syllables, words, sentences, or paragraphs, and output them in a suitable format?
… identify (proto‐)typical meanings, uses, and collocations of words?
… extract or manipulate parts of texts to create psycholinguistic experiments, or for teaching purposes?
… convert simple documents into annotated formats that allow specific types of analysis?
… create graphical user interfaces (GUIs) to edit or otherwise interact with our data?
We certainly won't be able to answer all these questions fully in this book, but at least work towards developing a means of achieving partial solutions to them.
Having discussed why we should write our own programs at all, let's now think briefly about why Python may be the right choice for this task. First of all, despite the fact that Python has already been around for more than 30 years at the time of writing this book, it is a very modern programming language that implements a number of different programming paradigms – i.e. different approaches to writing programs – about which, however, we won't go into much detail here because they are beyond the scope of this book. More importantly, though, Python is relatively easy to learn, available for all common platforms, and the programs you write in it can be executed directly without prior compilation, i.e. having to create one single program from all the parts by means of another program. This makes it easier to port your programs to different operating systems and test them quickly.
In terms of the programming paradigms briefly referred to above, it is important to note that Python is object‐oriented (see Chapter 7) but can be used procedurally. In other words, although using object orientation in Python provides many important opportunities for writing efficient, robust, and reusable programs, unlike in languages like Java, it's not necessary to understand how to create an object and all the logic this entails before actually beginning to write your programs. This is another reason why the Python learning curve is less steep than that for some other popular programming languages that could otherwise be equally suitable.
Despite my initial cautionary note about using other people's modules, of course we don't always want to reinvent the wheel when it comes to particular tasks that someone else may already have solved in an appropriate way. Thus, as long as we can ensure that these modules in fact do what we expect them to do, there are many additional modules available for Python that may simplify specific problems, such as parsing out the content of web pages in order to extract only the parts we may require, etc.
Last, but not least, another important advantage of Python is that it is becoming increasingly popular with linguists and computational linguists, so that you may a) be able to find many suitable modules to simplify your tasks, and b) – more importantly – there are many opportunities to cooperate with like‐minded researchers in your programming efforts or get advice from more experienced programmers.
In this section, I'll first present an overview of the book. As many of you are probably less familiar with issuing commands in text form to interact with their computer's operating system, prior to delving into our actual programming efforts, I’ll first introduce the most important concepts involved in working with the computer in this way, and installing the software required for our purposes. Following this, Chapters 2 and 3 will introduce you to programming fundamentals – statements, variables, control structures, etc. – thereby enabling us to develop strategies for solving language‐related questions computationally in their most basic form. In Chapter 2, you'll also learn some of the basics of working with strings, which represent the most useful data type for our language‐related purposes.
Chapter 4 is designed to allow you to grasp more intermediate concepts in string processing, laying the foundation for processing words and short pieces of texts to do basic morphological analysis, clean up data, break sentences into words, as well as create formatted output as the most elementary form of visualising language data. In Chapter 5, you'll then learn how to work with longer pieces of data, stored in the form of text files, for handling and saving results, including a discussion of how to handle the folder structure on your computer efficiently and in a platform‐independent manner.
The next chapter will introduce you to regular expressions, a powerful way of recognising simple to highly complex linguistic patterns, and processing them. This knowledge will enable you to perform tasks that are especially relevant to advanced language processing, and go way beyond the options provided by Python's basic string processing methods, such as searching through one or more files in order to extract and display information based on more or less complex patterns you'll learn to specify.
In Chapter 7, we'll move on from learning about the essential concepts towards applying these in developing our own applications, even if these may initially still be relatively simplistic. We'll start this part of the book by discussing the essentials of modularity and object orientation, thereby providing a foundation for writing more efficient programs and reusable components for increasingly complex and repetitive programming tasks. Here, for instance, we'll learn how to design user‐defined functions that allow us to handle simple lexica for performing (equally basic) word‐class annotation tasks, or how to set up our own object to model the behaviour of specific types of words. Chapter 8 will then turn to creating word and frequency lists, and developing an understanding of different sorting options. This will allow us to create useful objects to quantify and identify linguistic phenomena in various ways, as well as to display them in ways that are appropriate for different analysis tasks.
In Chapter 9, I'll introduce you to creating graphical user interfaces (GUIs) as a means to facilitate handling and interacting with data. While this may seem like something you don't really need for analysing language data, in my experience it is invaluable in providing yourself – as well as any potential users of your programs – with ways of interactively outputting and exploring data in forms that are often not possible on the command line, especially if you're dealing with different languages or older forms of language. By necessity, though, we'll have to restrict our endeavours here to producing relatively simple GUIs, but hopefully you'll be able to use the information provided here to develop your understanding further independently, so that you'll later be able to create more complex ones that fit your exact analysis needs or those of any projects to which you may be contributing.
In Chapter 10, we'll learn how to download and handle web data, and produce – as well as work with – annotations. As more and more data these days originates from the web, and many levels of language analysis require some form of interpretive coding, these two areas also represent very important aspects of programming for language analysis purposes. The final section also includes an introduction to the annotation scheme of the Text Encoding Initiative (TEI), a scheme commonly used for corpora and other texts in DH.
The final main chapter will introduce some basic concepts in creating visualisations, such as producing frequency plots using the matplotlib library or generating word clouds.
Other programming books may provide you with the necessary theory, walk you through code/coding examples step by step, and then give you some more advanced exercises that essentially send you off on your own to explore things further, but then never offer any solutions. In my experience, such an approach is less effective because it runs the risk that you may simply end up doing simpler exercises mechanically, or end up learning only half of what may be relevant because the main exercises are too limited.
My approach in this book is rather different from this – perhaps more academic – because I generally start by introducing the most essential aspects of the programming constructs covered first, but then ask you to apply these concepts immediately to particular questions or projects in processing textual data, as and when necessary or relevant even introducing additional details inside the exercises. This way, you'll not only be forced to apply the concepts, but also to think about how this can best be achieved in solving language‐related issues. The more we progress through the book, the more complex these mini‐projects may get, and they will frequently also be designed to build upon many concepts covered in prior sections, so, in a sense, they also serve as a kind of repetition for you.
However, because the exercises may get rather complex, I will also provide detailed discussions of them at the end of each chapter. Here I not only show you the code that I consider most appropriate, based on your current level of knowledge, or perhaps even some possible alternatives, but will also explain important issues pertaining to these solutions. In addition, I'll discuss potential pitfalls or any error messages you may encounter, especially in the earlier chapters. Hence, even if you may be able to complete the exercises without any help, you should probably still read through the discussion each time you've completed an exercise to learn about these additional aspects before continuing with the main text.
All programs we produce as part of these exercises are listed in complete form in the Appendix, and will also be available, along with any other resources, from the book's companion website at http://www.wiley.com/go/weisser/pythonprogling. To challenge you a little more, I'll frequently also provide suggestions as to how you can develop the programs we devise together into more advanced little projects that you can carry out on your own in order to develop your programming skills further independently.
In this book, I'll use the general conventions for representing different types of information for linguistics purposes, as well as a few other ones designed to make it easier for you to distinguish between the descriptive text and the coding constructs presented. Language samples or passages used as examples are represented in italics. To distinguish between different linguistic levels of description, if necessary, I use the appropriate bracketing, e.g. curly brackets ({…}) for morphology and angle brackets (<…>) for graphemes. In Chapter 10, however, the latter will generally represent parts of the syntax of the markup languages HTML and XML.
Key terminology will be highlighted like this, so you can identify it more easily, and expressions that deviate slightly from the standard meaning will appear in scare quotes (‘…’). To facilitate distinguishing between descriptive text and programming constructs, I will use this font, with variable elements in the code, especially in syntax descriptions, being marked through italics. Syntax summaries are further distinguished via a box with a .
To be able to make coding examples stand out even more clearly, in many cases, I'll write these on a separate line, even if they form part of a longer sentence. In such cases, I'll frequently also omit some punctuation marks, such as commas or full stops, so that these don't appear to be part of the program code.
Installing Python on your computer should – on the whole – not represent a big problem because installation packages for the different platforms can conveniently be downloaded from https://www.python.org, and the installation itself presents no major obstacles if you observe a few simple points, of course provided that you have administrative rights on the computer you're using. If you're using a shared computer and have no administrative rights yourself, then you'll need to consult your administrator.
Python is frequently already preinstalled on Linux and macOS, but unfortunately often only version 2, which also tends to be required by the operating system (OS), and is therefore non‐replaceable! In these cases, the solution is to carry out a parallel installation of 3 alongside version 2 because we'll be using Python 3 for this book. If you're running such a parallel installation of Python 3, you'll also need to set up the so‐called shebang line (explained in Section 2.5) correctly, so that your OS will know which version of Python to use for running your programs. In the following sections, I'll describe the installation process for the different OSes covered here one at a time. At the time of writing, Python 3.11, which is supposedly much faster than previous versions, had already become available. However, not all of the Python modules used in later parts of the book were available for this version, so that I'd currently still recommend maximally installing Python 3.9, which I've tested with all modules.
From Python 3.8 onwards, Python will be set up in your user directory by default, e.g. ‘C:\Users\username\AppData\Local\Programs\Python\Pythonnumber’, where username is your own username and number the version number without the dot, i.e. 38 for version 3.8, and 39 for 3.9. As an administrator on your computer, you can also switch this to installing it for all users by checking the customisation option shown towards the bottom of Figure 1.2, in which case it would normally be installed into the folder ‘C:\Program Files\Python39’.
Figure 1.2 Python installer running on Windows.
You should also ensure that the box for “Add Python X to PATH” – where X stands for the version number – is checked in order for Windows to be able to find the Python interpreter, the program that converts your Python instructions into executable code, and allow you to launch Python programs by double‐clicking from Windows Explorer.
On the Mac, you just need to follow the basic instructions shown in Figure 1.3.
Figure 1.3 Python installer running on macOS.
When installing Python on macOS, there is no issue associating your files with an interpreter or setting the path because macOS and Linux handle the execution of programs differently from Windows, through the shebang line, through which you tell the OS which interpreter to use.
To install Python 3 on a Linux system, you should use whatever packet manager is appropriate. However, as Linux distributions differ rather strongly from one another, I cannot describe the installation process in any detail here. As on the Mac, Linux uses the shebang line to ensure that the right version of Python will execute your programs later.
Go to https://www.python.org.
Find the most up‐to‐date Python 3 version for your OS. Note: If you're using Linux, you won't find an installer on this website, but you should use your package manager for locating one instead.
Download and install Python, making sure that you select the option for adding Python to your path if you're on Windows, and installing certificates on the Mac after completion!
Now that you should have a copy of Python 3 on your computer, we can verify that the installation process ran correctly, and then start investigating how to use it. Before we can do so, though, we first need to explore how it is possible to issue the right commands to your computer in the form of text you type in, which may well be something that you're still unused to.
Most computer users these days are probably more familiar with interacting with their OSes through windows‐based systems, i.e. so‐called Graphical User Interfaces (GUIs – /guːiz/ – for short). However, before such GUIs became prevalent in computing, it was customary to interact with the OS by typing in commands at what is referred to as the command line on Windows, as the console or terminal on Linux, and Terminal on Mac. For the sake of simplicity, from now on, I'll refer to this as the command line.
The command line allows users to input text‐based commands via the command prompt, which is generally signalled via a flashing cursor, and will initially be your only way of running Python or any simpler programs written in Python that don't have a GUI themselves. We'll later learn another, slightly more comfortable, way of starting and testing your programs through WingIDE Personal, the program that I'm recommending you use for writing your Python code. In addition, working with the command line will allow us to learn about some important concepts related to handling files and folders on your computer, which will form an important part of your programming once you start working with stored data from Chapter 5 onwards.
In order to issue commands to all three OSes, you type their name, plus any potential arguments, i.e. other required information such as filenames, etc., and then press Enter to trigger the command. In the next two sections, I'll describe how to access the command line, first for Windows, then for Mac and Linux.
To activate the command line on Windows, there are multiple options. Perhaps the simplest one for most users initially is to press the key or click on the Start () button, type cmd next to the magnifying glass symbol, and click on Command prompt in the dialogue box shown in Figure 1.4.
Figure 1.4 Activating the command prompt via the Windows Start menu.
Depending on how many programs or files Windows finds that start with the letter c, this option may already be presented to you even if you only press c or cm. As you can see, there are multiple actions available for the command prompt on the right‐hand side of the start menu, other than just clicking to open it. To simplify opening the command prompt, you could for instance pin the icon to your taskbar if isn't already too crowded, and then have it available with one single mouse click. Another, more important, option you may need later if you've installed Python as an administrator, is that you can also run the command prompt in that capacity, which will then allow you to install additional packages for all users.
Another quick way to access the command prompt is to press + r, then type cmd in the ‘Run’ dialogue that will open, and press the Enter key (↲) or click on ‘OK’. If you hold down the ‘Shift’ key (⇧) and the ‘Ctrl’ key while pressing ‘Enter’, you can also open the command prompt as an administrator.
If you're already looking at a folder that contains your programs in Windows Explorer, it's even more convenient to type cmd in the Explorer address bar and press the Enter key. This will open up a command prompt directly at the folder location, so you won't actually need to navigate there once the command line has been opened, which we'll nevertheless soon practise.
To activate the command line on the Mac or Linux, you need to start Launchpad (Mac) or click the start button (Linux), search for Terminal (or a similar name), then execute Terminal. On the Mac, you can also add the Terminal to the dock for convenience. This is usually also the case for Linux panels, but may depend on your exact Linux version.
Open a command line for your OS.
Type in python ‒V, and press Enter. You should then see the version number of your Python installation reported at the prompt.
Now that you know how to issue commands on the command line, let's take a brief look at which types of programs you can use to write your Python programs in.
For most writing tasks on the computer, we tend to use dedicated word processors like Microsoft Word® or LibreOffice Writer that enable us to apply appealing layouts and formatting to whatever we write. These programs, however, generally store the texts produced in them in such a way that they can more or less only be opened and edited further by whatever program was used to create them, apart from containing many formatting instructions required to generate the display or print them. For writing program code that needs to be readable by the Python interpreter, and ideally editable by different programs available on the different OSes we may want to use, word processors are therefore not useful. The kind of text we need, which contains no formatting or fancy layout apart from perhaps line breaks or indentations, is called plain text, and the programs we can use to edit them are called (plain text) editors. Examples of these would be Windows Notepad®, or TextEdit on macOS. Some of these editors even offer special support for programming languages, such as syntax highlighting for different programming or markup languages (see Section 10.1), but this support, if it exists at all, still tends to be fairly limited.
Better suited to programming tasks are so‐called Integrated Development Environments (IDEs, for short). These offer additional programming support, such as finding errors in code (debugging), advanced syntax highlighting and indentation, syntax completion, etc., sometimes for multiple programming and/or markup languages.
One such IDE that is optimised for Python and markup languages is the WingIDE. It exists in different versions, and is available for Windows, MacOS, and Linux. The version that is of interest to us here is the Personal edition, as it's freeware, just like Python, so that you won't have to invest anything other than your time into learning how to program in Python, but can still enjoy a number of features that will greatly simplify your programming tasks. Of course, there are also other freely available IDEs, or you may already have a preferred IDE, so using WingIDE is only a recommendation I'd like to make having evaluated a number of other IDEs. In case you (or your instructor) decide to use a different IDE, you can of course skip the remainder of this chapter, and move straight on to the next one. Before you do so, though, you should at least check to see if the output encoding for your chosen IDE has been set to UTF‐8, which is usually done in the IDE's program preferences. Read on just a little to find out why this may be sensible to do.
The Personal edition of the WingIDE can be downloaded from https://wingware.com/downloads/wing‐personal. The exact installation routine depends on your OS, but is generally quite straightforward, so we won't discuss it here, instead carrying it out as part of our next exercise. However, there may be a few settings to modify after the installation. The most important of these is that, at least prior to version 8, the default encoding (see Section 2.3.1) in which files are saved is automatically set to the local encoding on your computer, which is not an optimal choice. Hence, you'll minimally want to change this, setting it to UTF‐8 via ‘Edit → Preferences → Files → Default Encoding’ in order to be able to use non‐English characters properly as well in your code. It's also possible to customise many of the display options, such as setting a larger font, changing the editor background to make it easier on your eyes, displaying line numbers, or even setting a different display language, but these are largely questions of preference, so, again, we won't discuss them here.
Download the version of WingIDE Personal appropriate for your OS and install it.
Change the settings for the encoding if necessary.
Familiarise yourself a little with the features and functions of the IDE by looking through the menus, etc.
As we've now covered all the preliminaries, in the next chapter, we can finally begin to learn about some of the essential concepts required to allow you to begin programming.
Provided that you've followed my instructions carefully, there are only a few things that could have gone wrong during the installation, unless of course you're not authorised to install any software on your computer at all, in which case you'll need to ask your administrator to set up Python for you. At the time of completing this book, the most recent Python version was 3.11, but you may need to install a version lower than 3.9 if you should still be running Windows 7, which I wouldn't recommend, anyway.
On Windows, should you have forgotten to tick the box to add Python to your path, ensuring that Windows finds the Python interpreter and that your programs will also run if you double‐click on them in Explorer will get a little complicated. To do so, you need to go into the Windows settings, most easily accomplished by pressing + i, typing path into the search box (see Figure 1.5), and selecting either ‘Edit the environment variables for your account’ (for non‐administrators) or ‘Edit the system environment variables’ (for administrators). As a non‐administrator, you can then select the Path option in the box at the top, click on ‘Edit…’, and add the path to your Python installation to the end of the path. As an administrator, there's one intermediate dialogue, where you need to click on ‘Environment Variables…’ first, and then follow the same steps as just described above.
Figure 1.5 Finding the Path settings.
Installing the certificates on the Mac, if you've omitted that step, may unfortunately prove necessary in order to be able to download files from the internet as part of the exercises in Chapter 10, so if you should have forgotten to do so, please install them asap.
Provided that your installation was successful, using the command python ‐V should output the Python version number, e.g. Python 3.9.7 on my computer, which indicates that I'm running version 3, with minor version 9, sub‐version 7.
Should you inadvertently have typed the wrong program name, perhaps pythion, you'll get an error message from the OS, indicating that the program name is not recognised. If you type a different capital letter after python, Python will display some usage information, but in case you've typed a small v or forgotten the argument completely, you'll end up with a different prompt that starts with >>>. This means that you've started the interactive Python interpreter, called the Python Shell, where you can actually already type in the Python commands we'll learn about later, and test different Python constructs. To close this interpreter and return to the OS prompt, simply type in exit() and press Enter.
Downloading, installing, and editing the encoding settings, if required, should be relatively straightforward, provided you download the right installer, follow all the instructions, and of course, have the necessary permissions to install programs on your computer.
When you open WingIDE Personal to explore, you should see that there are multiple sections or panels that offer different types of functionality over and above simply being able to create and edit program code. Initially, the most important part of the interface will be the editor window itself on the top left‐hand side below the menu and toolbars, although we'll later also make use of other components of the IDE window. This window can actually be split, so that you can view multiple files side‐by‐side in order to compare them or copy and paste from one to the other, or also view different sections of a longer program.
On the right‐hand side, spanning all the way from top to bottom of the program window, you'll see a window with a few ‘utility’ tabs that allow you
to manage a project (‘Project’) – something we won't discuss here;
get help on specific Python programming constructs (‘Source Assistant’);
explore or jump to different sub‐parts of your program (‘Source Browser’);
and potentially manage indentation issues (‘Indentation’).
The ‘Indentation’ tab, however, is something you'll probably only need to use if you work with code that may have been created in other editors or by other people.
The bottom left‐hand side is split into two panes, each containing multiple tabs, with the left‐hand pane containing tabs for searching (and replacing, if activated) in the currently active file or a number of files at the same time, as well as the ‘Stack Data’ tab, which is used for advanced debugging purposes we won't discuss in this book. In the right‐hand pane, the two tabs we'll discuss and use later on are ‘Debug I/O’ and ‘Python Shell’, whereas we won't cover the other two again.
The menu bar at the very top, as well as the toolbar below it, contain a number of familiar entries or buttons that essentially exist in most GUI programs, but also a few items that you'll probably still be unfamiliar with, and which relate to various aspects of handling the programming code. If you haven't done so already, I'd suggest that you at the very least try to read through the menus to see which entries you understand and may be useful to you in handling code, and also possibly which keyboard shortcuts you may want to use to increase your efficiency. Of course, there'll be quite a few things that won't make sense to you yet, but you can always try to understand them later, once you've made some progress in your programming career.
