14,99 €
Humans have used technology to expand our limited vision for millennia, from the invention of the stone mirror 8,000 years ago to the latest developments in facial recognition and augmented reality. We imagine that technologies will allow us to see more, to see differently and even to see everything. But each of these new ways of seeing carries its own blind spots. In this illuminating book, Jill Walker Rettberg examines the long history of machine vision. Providing an overview of the historical and contemporary uses of machine vision, she unpacks how technologies such as smart surveillance cameras and TikTok filters are changing the way we see the world and one another. By analysing fictional and real-world examples, including art, video games and science fiction, the book shows how machine vision can have very different cultural impacts, fostering both sympathy and community as well as anxiety and fear. Combining ethnographic and critical media studies approaches alongside personal reflections, Machine Vision is an engaging and eye-opening read. It is suitable for students and scholars of digital media studies, science and technology studies, visual studies, digital art and science fiction, as well as for general readers interested in the impact of new technologies on society.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 362
Veröffentlichungsjahr: 2023
Cover
Title Page
Copyright
Acknowledgements
Introduction
Seeing more, seeing differently, seeing everything
How vision is situated
Situations and stories as analytical tools
Representational and operational images
Structure of the book
Notes
1 Seeing More: Histories of Augmenting Human Vision
The relationship between humans and technology
Using glass and crystal lenses to see more clearly
The camera obscura and recording images
Linear perspective and operative images
Seeing ourselves through photography
Capturing speed: Muybridge’s horse in motion
Seeing in the dark
Notes
2 Seeing Differently: Exploring Non-human Vision
Technical cognition
The kino-eye: ‘I, a machine, show you the world as only I can see it’
Flusser’s technical images
Biosemiotics and cybersemiotics
Training datasets and learning to see
Cyborg vision or seeing as an assemblage
Notes
3 Seeing Everything: Surveillance and the Desire for Objectivity and Security
Fantasies of omnivoyance in mythologies and religions
Invisible watchers and the modern panopticon
Neighbourhood cameras
Flock cameras as domesticated dragnet surveillance
Oak Park: how local history played into the assemblage
Ring doorbell videos and communal fear
Surveillance as a promise of safety
Machine vision situations as affective assemblages
Does machine vision reduce crime?
Fear and distrust feed the surveillance industry
Notes
4 Being Seen: The Algorithmic Gaze
Normalising faces
The assemblage of an unstaffed grocery store
Controlling access
Watched by benevolent AI
The assemblage of being seen
Notes
5 Seeing Less: The Blind Spots of Machine Vision
Breaking the oppressors’ tools
Hiding from facial recognition
Broken machine vision
Making bodies machine-readable
Notes
Conclusion: Hope
Notes
References
Index
End User License Agreement
Chapter 1
Figure 1
Eadweard Muybridge’s photograph of ‘The Horse in Motion’
Chapter 2
Figure 2
In this shot from Vertov’s Man with a Movie Camera, an image of Elizaveta Svilov…
Figure 3
A still from Man with a Movie Camera (1929) showing a double exposure of a typist’s …
Cover
Table of Contents
Title Page
Copyright
Acknowledgements
Introduction
Begin Reading
Conclusion
References
Index
End User License Agreement
iii
iv
vi
vii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
JILL WALKER RETTBERG
polity
Copyright © Jill Walker Rettberg 2023The right of Jill Walker Rettberg to be identified as Author of this Work has been asserted in accordance with the UK Copyright, Designs and Patents Act 1988.
First published in 2023 by Polity Press
This work is also available in an Open Access edition, which is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://creativecommons.org/licenses/by-nc-nd/4.0/.
Polity Press65 Bridge StreetCambridge CB2 1UR, UK
Polity Press111 River StreetHoboken, NJ 07030, USA
ISBN-13: 978-1-5095-4524-7
A catalogue record for this book is available from the British Library.
Library of Congress Control Number: 2022948740
The publisher has used its best endeavours to ensure that the URLs for external websites referred to in this book are correct and active at the time of going to press. However, the publisher has no responsibility for the websites and can make no guarantee that a site will remain live or that the content is or will remain appropriate.
Every effort has been made to trace all copyright holders, but if any have been overlooked the publisher will be pleased to include any necessary credits in any subsequent reprint or edition.
For further information on Polity, visit our website: politybooks.com
The ideas in this book have grown and developed over many years, and I am indebted to many friends and colleagues for generous discussions and debates. Both the Digital Culture and the Electronic Literature research group at the University of Bergen have been invaluable spaces for sharing and generating ideas. I am grateful for feedback on drafts from my colleagues Gabriele de Seta, Marianne Gunderson, Ragnhild Solberg, Linda Kronman, Joseph Tabbi, Scott Rettberg and Tuva Mossin. Ingunn Lunde generously answered my questions about the meaning of the original Russian title of Man with a Movie Camera and helped me catch a couple of embarrassing misspellings too. Annette Markham gave me wonderfully inspiring feedback on an early stage of the draft at our writing retreat just before the pandemic broke out. My editor at Polity, Mary Savigar, gave very useful feedback on drafts of the manuscript, and the peer reviewers’ feedback was also very helpful. I also want to thank Stephanie Homer at Polity and Caroline Richmond for copy-editing. My developmental editor, Margaret Puskar-Pasewicz, gave me great feedback especially in the early phases of the project, and for the final spurt I’ve leaned on my writing group: Laura Saetveit Miles, Mathilde Sørensen and Sari Pietkäinen. Our writing coach K. Anne Amienne has followed this project from the beginning, when I first realised I could spend project funding on a coach to make sure I do the writing I really want to do.
In Oak Park several people generously shared their thoughts with me and gave me feedback on my analysis of the Flock Safety camera debates. I would particularly like to thank Scott Sakiyama, Kathleen Finn Bell and Alicia Chastain for generously reading and commenting on a draft of the chapter about Oak Park. Thank you also to Emily Bembeneck and Sendhil Mullainathan at the Center for Applied AI at the University of Chicago for inviting me to be a visiting scholar at the center and for their support while I was there.
This book is an outcome of research funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771800).
Most of all, thank you to Scott and to Aurora Jonathan, Jesse and Benji. Thank you for letting me see the world with you.
Argus was a giant with a hundred eyes, older than the ancient Greek gods but a servant to them. He could see in all directions at once and he never stopped watching. Even when he slept, only some of his eyes were closed.
Human vision is far more limited. We have just two eyes in the front of our heads and can’t see what is behind us at all. We see what is straight ahead of us clearly, but our peripheral vision is poor. Many other animals have eyes on the sides or even on the backs of their heads. Some species can see infrared or ultraviolet light. Humans cannot. We can see only in the rather limited way that our eyes, our brains and our bodies enable. And yet, for sighted humans, vision is the primary way we make sense of the world around us.
Humans have used technology to expand our limited vision for millennia. We have imagined mythical creatures such as Argus and created stories about future technologies such as optical implants or holographic phones. The dream and the promise of machine vision is that it will enhance our limited human vision. We imagine that technologies will allow us to see more, to see differently and even to see everything. But each of these new ways of seeing carries with it its own blind spots. The blind spots and distortions of machine vision may be different from the blind spots and distortions of unaided human vision, but machine vision technologies are limited by their own material constraints. Machine vision changes what humans can see. From telescopes and cinematic cameras to facial recognition, smart surveillance and emotion recognition: how will these new extensions of human vision change our perception of the world? What will we not see when seeing with machine vision?
This book analyses the relationships between humans and machine vision technologies by exploring the historical development of technologies that have helped us to see, ranging from the first mirror, which was carved from black obsidian 8,000 years ago, through the telescopes that ignited the scientific revolution, to contemporary networks of surveillance cameras that send automated alerts about suspicious activities to their owners or to law enforcement. Machine vision can create great beauty and make wonderful things possible. New visual technologies enable scientific advances and help to cure diseases. Artists and filmmakers use animation, virtual reality and images generated by deep learning to create breath-taking imagined spaces and images.
Machine vision also comes with many problems and limitations. Algorithmic bias affects machine vision as it does other technologies using machine learning and big datasets. Facial recognition systems are often intrinsically biased; they are better at identifying white men than black women. A neural network trained on internet images with English captions will re-create the bias in the training data, generating images and propagating a version of a world where humans are almost always white, nurses are women, doctors are men and terrorists look Arabic.1
Some visual technologies, such as microscopes or high-speed cameras, allow us to see objects that are too small, too distant, or too fast for the human eye to detect. Others allow us to see wavelengths beyond visible light, such as night vision goggles using infrared to perceive warm bodies in the dark. We use radar and ultrasound and LIDAR to send out signals that bounce off objects, and that allows us to generate three-dimensional models of objects we cannot otherwise see: approaching aeroplanes in the dark or an unborn child sucking its thumb in its mother’s womb. Satellites, drones and networks of surveillance cameras create vast datasets of images that can be processed by computers to find and identify individuals or track changes in ways that were never before possible. Cameras keep watch for us, fastened to doorbells, street signs and buildings. These cameras are automated by artificial intelligence models that recognise faces or car licence plates, sending alerts to their owners or the police when they identify something as suspicious.
I define machine vision as the registration, analysis and representation of visual information by machines and algorithms. Machine vision technologies register visual information and store it as data that can be processed computationally. My definition is intentionally broad to allow us to analyse the larger-scale shifts that are currently taking place in visual representation. I chose the term ‘machine vision’ instead of ‘computer vision’ because I want to include the history of seeing with technologies. New visual technologies have been agents of cultural change well before computers. As I’ll discuss in chapter 1, the fifteenth-century invention of linear perspective, coupled with the glass lenses needed for telescopes, were central to the scientific revolution and the modern age. Photography, cinema and other nineteenth- and twentieth-century imaging technologies likewise came with societal change and scientific advances. Machine vision has advanced exponentially in the last decade due to AI: there has been rapid progress in machine learning models trained on massive datasets. Will these new technologies lead to new paradigms, as happened with telescopes in the Renaissance and photography in the nineteenth century? This book aims to contribute to our understanding of what we are becoming.
The idea of artificial intelligence (AI) is at least as old as the ancient Greeks, as Adrienne Meyor demonstrates in her book Gods and Robots, but it was in the 1950s that advances in computer science made actual thinking machines begin to seem feasible. Let me try to explain how the AI used to generate images or recognise faces works. Two main strands of AI have been developed from the 1950s.2 The first, symbolic AI, is based on the idea that a form of common sense could be explicitly coded as a set of rules or algorithms that would allow a computer to think rationally. The second kind, subsymbolic AI, is based on machine learning from data. With machine learning, a computer program is written to analyse a dataset and infer its own rules from patterns it finds in the data. Until the 1990s, symbolic AI seemed the most likely to succeed. However, with the extreme expansion of available training data due to internet content, along with increased processing power, subsymbolic AI or machine learning took off. This led to radical improvements first in machine vision and soon after in large language models that can generate news stories, summarise texts or act as a very convincing conversation partner.3 Current AI is impressive, but it can still only do specific tasks, such as classifying images, playing a game of chess or generating text that looks similar to something a human might write. Some people think that in time this will lead to artificial general intelligence (AGI) – that is, a computational system that, like humans, can do many different tasks and that might even be sentient. I love reading science fiction about sentient AI, but I think this is still firmly fiction. In chapter 2 I’ll discuss how AI can be said to cognise rather than think in a self-reflective way as humans do. The way AI cognises is quite different from human cognition, and that means that AI-driven machine vision is quite different from human vision.
Machine learning was first used in image recognition. In 1957, Frank Rosenblatt proposed ‘the perceptron’, a single-layer neural network that could read handwritten numbers.4 In this kind of machine learning, individual units (‘neurons’) are trained on a set of images labelled by humans. For instance, an image of a cat is labelled ‘cat’ and an image of a dog is labelled ‘dog’. The units are given random numeric values to start with and the program adjusts their values based on the input. The input from all values is then combined and checked against the label. Imagine that the correct value for ‘cat’ is 1 and ‘not cat’ is 0, and the first round of training produces the score 0.6. The model is now given the information that the image is a cat and the value should be 1. Then it goes through the data again, changing its processes more or less at random. If the score after the second round is closer to 1, the model learns that, whatever its new strategies were, they were better. It tries again, becoming less random with each round as it learns which strategies are successful and which are not. After many such rounds, the model will be trained and able to identify the image of a cat that it was trained on. But it may not be able to identify a photo of a new cat.
In the 1970s, deep learning was proposed, where there are several layers of ‘neurons’, each layer feeding its results to the next. Deep learning produced better results than Rosenblatt’s single-layer neural network but was not as successful as symbolic AI and was not developed much further until the 1990s.5 A major shift occurred in 2010, when researchers gained access to big data generated on the internet and to far higher computing power. Deep learning (also often called ‘neural networks’) made rapid advances, driven first by image recognition trained on ImageNet, a database of images scraped from the internet that was semantically organised using WordNet. Kate Crawford and Trevor Paglen’s artwork ImageNet Roulette and their accompanying essay explain how this works and demonstrate how problematic the results can be. WordNet includes categories that cannot be unambiguously expressed in images (such as ‘sex worker’) as well as slurs and other problematic terms, so when used to classify images, and especially images of people, you run into problems.6 Rapid advances were also occurring in large language models (LLMs), which are trained on vast amounts of writing from the web and from books. By 2017, both text generation and image recognition gave impressive results.7 Self-supervised learning also came to the fore, meaning that datasets no longer have to be annotated by humans before being used as training data for a machine learning model.
In 2021, a group of Stanford researchers coined the term ‘foundation model’ to describe models that use deep learning at such scale that they gain new capabilities, in particular homogenisation and emergence.8 They have a homogenising effect because one model is used for many tasks, which can give more stability but also means any defect or bias will be inherited by all downstream applications. Emergence is another key feature: these models have unanticipated effects. For instance, the developers did not expect that large language models would be able to generate text. Foundation models are so expensive to train that, as of 2022, only big tech companies can afford to train them, but they are then fine-tuned and put to many other downstream uses.
As I finish writing this book, image generation models such as DALL-E, Midjourney and Stable Diffusion are capable of generating photorealistic images from written prompts, and large language models such as GTP-4 can have convincing conversations and answer general knowledge questions, though still with some factual errors. These models depend upon the deep-learning structure I described above, but they are trained on even more data and with even more parameters. In my simple example above, where ‘cat’ is 1 and ‘not cat’ is 0, there is just one parameter – cat or not cat. Current models can be trained on more than a billion parameters. To an AI model, that cat is understood as a vector – that is, a list of numeric values, one for each parameter. Perhaps the vector for cat is [0.642, 0.231, 0.932, …], and so on. Once trained, the model no longer has access to the original photos. Instead it operates with what is called a vector space or semantic space, or sometimes just space, where all the vectors are organised in a multidimensional grid. Remember those coordinate grids you draw in seventh grade, where you plot a point on an x–y grid? To find the point [1,4] you draw a line from 1 on the x-axis and 4 on the y-axis and see where the lines meet. The vector space or semantic space of a machine learning model is like that, but each parameter is an axis. There isn’t just an x-axis and a y-axis, but a z-axis and a billion more dimensions. I doubt you can imagine that visually, but powerful computers can compute it.9Latent space is another term that is used in machine learning research: this is a lower-dimension version of the vector space that can be sufficient to generate new data that is similar to the training data. The important thing to remember is that a trained deep-learning model does not directly access the training data; it accesses only this multidimensional set of vectors describing different features of the dataset, such as words or concepts or characteristics of images.
Image generation models such as DALL-E are trained on images with captions from the web.10 Users can write a prompt describing an image, and the model will generate images based on the concepts it has learned from the dataset. These concepts can be surprisingly complex. For example, OpenAI’s CLIP model has a specific neuron (or unit) that has learned to respond to the concept ‘spider’ and can use it to group drawings of spiders, the written word ‘spider’ and pictures of Spiderman. Reading the paper announcing these ‘multimodal neurons’, you can sense the wonder of the researchers, who describe such a model almost as though it is a child: ‘Some neurons seem like topics out of a kindergarten curriculum: weather, seasons, letters, counting, or primary colors. All of these features, even the trivial-seeming ones, have rich multimodality, such as a yellow neuron firing for images of the words “yellow”, “banana” and “lemon”, in addition to the color.’11 The paper, which is rich with interactive visualisations, goes on to show how emotions such as ‘happy’ or ‘sleepy’ can be identified across facial expressions or body language, and how concepts can also connect to their opposites.
If you would like a deeper understanding of the technical aspects of AI that contemporary machine vision builds upon, I recommend Melanie Mitchell’s book Artificial Intelligence and Kate Crawford’s Atlas of AI.12 Both of these books give solid but accessible explanations for a general audience. The first few pages of the Stanford report on foundation models also provide a brief but relatively accessible technical explanation. Mark Andrejevic and Neil Selwyn’s book Facial Recognition details the historical development of facial recognition in particular and explains how this specific technology works in more detail than I can here. OpenAI, Meta and Google also provide accessible explanations to many of their models on their websites. These often include interactive visuals, as well as links to the research papers describing each model.
The focus of this book is how different kinds of machine vision allow humans to see in new ways. Without technology, human vision is situated in two eyes and a brain that processes their visual input. With access to home surveillance cameras and DALL-E and satellite images of my neighbourhood, I can see a lot more than just what is straight in front of me.
The chapters of this book will explore ways in which machine vision expands or escapes the situatedness of human vision: by seeing more, by seeing differently, by seeing everything, by being seen and, finally, by exploring what machine vision does not see.
Vision is always situated. I use ‘situated’ in a sense established by Donna Haraway in her influential article ‘Situated knowledges’, which was published in 1988. Haraway argues that the closest we can get to objective knowledge is to acknowledge that we always have only a partial perspective. Visual technologies, from satellite surveillance to medical imaging, seem to promise the impossible: ‘the god trick of seeing everything from nowhere’, as Haraway writes.13 In contrast to this ‘god trick’, Haraway argues that knowledge is embodied. Therefore ‘objectivity turns out to be about particular and specific embodiment.’14 When I write that vision is situated, I mean that we always see from our own situation in the world, from a particular standpoint and within the limitations of the physical constraints of our bodies. When I look out of my window, I see a view of my neighbourhood that is slightly different from what a neighbour would see from their window, and quite different from what a satellite image of the neighbourhood would capture. What I see is also situated by how sharp my vision is, by my personal experiences (do I know who lives in each building or what it means that my neighbour hasn’t put the trash out as they usually do), by the time and season (is it dark or light), and many other things.
Machine vision technologies often present dazzling overviews that appear to escape this situatedness: satellite images showing the globe in amazing detail, images of distant galaxies or of the microscopic worlds inside the cells of our bodies. These kinds of image appear to be able to show the world as though we are outside of it. They appear to be objective and to show the world as it really is. Haraway argues, and I agree with her, that this objective outside view is impossible.
Saying that vision is situated also means that seeing is embodied. What we can see is shaped by the constraints of our bodies or by the constraints of the technologies we see with. We see with two eyes, not a hundred, and, unlike many species, we have poor peripheral vision. I’ll return to how different species and different technologies see differently in chapter 2.
When we use machine vision we are no longer entirely bound to our human point of view or to the limitations and affordances of our eyes and our brains. We can see the Earth from outer space, or the blood vessels inside our bodies; we can see the heat of bodies 30 kilometres away15 or capture the motion of a galloping horse in a high-speed photograph, where we would otherwise see nothing but a blur. Machine vision can make distant events feel very close, as when we see live videos of war atrocities, a carjacking captured by a neighbour’s doorbell camera, or TikTok videos recorded in a teenager’s bedroom. New visual technologies such as searchable satellite images and electron microscopes and VR glasses are all situated and thus limited ways of seeing, but we easily forget this. It is easy to be swept away by the promotional material and the gorgeous visuals. Perhaps we are also a little seduced by Haraway’s ‘god trick’, or what José van Dijck calls dataism: ‘the ideology of dataism shows characteristics of a widespread belief in the objective quantification and potential tracking of all kinds of human behavior and sociality through online media technologies.’16 This trust in technology as an almost divine power will be a recurring theme in this book.
Machine vision is non-human in that it allows us humans to see things that would otherwise be invisible to us. At the same time machine vision is completely human: humans imagine it, humans design it and humans use it. Machines do not see without us, or, perhaps more precisely, they would not see without us. Machines depend on humans as much as humans depend on machines. Machine vision doesn’t ‘see’ alone. Rather, its sensory apparatus – its hardware and the algorithms it uses to process data – is always part of an assemblage that humans also participate in.
I understand machine vision technologies not as technological monoliths that inevitably determine human behaviour but as participants in assemblages where humans, technologies and cultural contexts act together. By focusing on the assemblage more than on the technology itself, I build upon posthumanist and feminist theories that emphasise relationships between humans and non-human agents such as technologies, institutions and our natural environment. The prefix post in posthumanism indicates that it comes after the humanism that began in the Enlightenment era, when the human was seen as the centre of the universe, the subject who could rule and control all other creatures and entities. For this master human subject, technology, the environment and even other groups of humans were seen primarily as objects or tools. Posthumanism emphasises relationships and mutual interconnection instead of the binary opposition between an active subject and a passive object. The concept of the assemblage helps us see how different agents come together in different constellations in different contexts.
We don’t fully control the technologies we use, and the technologies don’t fully control us. By being aware of the assemblages we choose to enter into (or that are thrust upon us) we can start to untangle how technologies work in specific contexts. Then we can try to design assemblages that help build the kinds of communities and societies we want to live in. To understand technology, then, we also need to understand the assemblages it participates in. I’ll go into more detail in chapter 2 about what it means to use the concept of assemblages to think about technology.
The assemblages don’t consist only of humans and machines; cultural and regulatory contexts are also important. This book was written partly in Norway, my usual home, and partly in the USA, my temporary home for the first half of 2022. The contrasts between the two countries seemed stronger this time than on my previous visits, with anxiety ratcheted sky high in the USA due to the pandemic, to rising crime rates and to political tensions. The more I learned about how technologies are discussed and used in the Chicago suburb where I was living, the more I realised how differently these technologies were being adopted and understood there compared to my own home environment in Norway.
Technology does not have the same effects in all contexts. The mere existence of surveillance technologies such as automated licence plate readers or facial recognition does not necessarily mean all the world will use them or that they will be used in the same way in every context. Even within one country different technologies can be regulated or viewed very differently. In the USA, it is far easier to install facial recognition cameras in a school than to ban guns. Local US police departments can combine data from licence plate readers with hundreds of other public and data sources with little regulation, but there are no central gun registries. That information is protected by strong political lobbies.17 This means that it is far easier to implement smart surveillance systems across the USA than it would be to change gun control laws. In Norway, the private smart surveillance systems that have spread across the USA are for the most part illegal because of strong privacy legislation. These political and institutional structures are also important participants in the assemblages machine vision enters into.
One method I use to analyse the relationship between humans and machine vision technologies is exploring specific examples of situations where humans and technologies act together. Some of these machine vision situations are fictional or imagined and some are real.
The term ‘machine vision situation’ comes from my work with a stellar group of researchers on a digital humanities project to create a database documenting how machine vision technologies are represented in digital art, video games and narratives such as movies and novels.18 We wanted to explore how humans and machine vision technologies interact in assemblages where agency is distributed rather than framing the human as using technology as a tool. Working as a team, Ragnhild Solberg, Marianne Gunderson, Linda Kronman and I developed a model for analysing situations in the artworks, games and narratives that involved machine vision technologies. We identified agents in each situation and described actions they took in a structured way, so we could use data analysis and data visualisations to see overall patterns across the 500 novels, movies, video games and artworks we analysed. We discussed and wrote about our interpretations of how machine vision was used and represented in individual works, too, and discussed real-world examples with input from our collaborator Gabriele de Seta.19
Spending so much time reading, playing, watching and analysing art, games and narratives about machine vision gave us a very broad overview of how machine vision technologies are portrayed in fiction and art. In this book I draw upon many examples from these works, especially from science fiction literature and film. Throughout you will find short readings of artworks, movies, games and novels where machine vision technologies are central. I interlace the more theoretical discussions with these analyses of fiction because fiction allows for another mode of understanding new technology that enables a more emotional and often more visceral, embodied kind of insight. You have probably noticed the surge in the popularity of science fiction in recent years. The most popular science fiction today deals with the near future. Series such as Black Mirror exaggerate contemporary issues just a little bit to make the ethical dilemmas even more acute: what happens when everyone has an implant that records everything they see or hear, as in ‘The entire history of you’, or when a mother implants her child with the Arkangel system, allowing the mother both to see everything the child sees and to alter the child’s sight so that ‘inappropriate content’, such as blood, is filtered out and not seen by the child.20
Artists are also exploring machine vision, both as spectacle and in more critical ways. Refik Anadol’s gorgeous, crowdpleasing Machine Hallucination installations use neural networks trained on thousands of images of cities to generate videos showing new, dream-like skyscrapers rising and falling, like the cities we know but strange. Other artists use machine vision technologies for critique and exploration of new situations that may become common. For instance, Lauren McCarthy and Kyle McDonald’s artwork US+ is a plug-in to be used during video chats that analyses users’ facial expressions and gives live advice about how to improve their interpersonal relationship. Video games are another popular medium where explorations of machine vision are common, whether as a playful aspect of the interface, as in the augmented reality of Pokémon GO, or as a substantial element in the story. The Watch Dog games let players view and control the game world through surveillance systems,21 while an indie game such as Samantha Gorman’s Tendar lets players adopt a virtual guppie that must be fed with emotions that it harvests from the player’s smile using emotion recognition algorithms.22
Watching movies, playing games, reading novels and experiencing artworks are important ways in which people think through possible situations that may occur with new technologies such as machine vision. The imaginary worlds of stories, games and art allow us to explore an emotional engagement with new technologies and the possible societal and ethical changes that may come with them. This emotional engagement tends to be lacking from computer science textbooks or patents for new smart home technologies. Through empathy with characters in fictional situations, we imagine how we ourselves would react and what choices we would make. By interacting with games and digital artworks, we can make choices without the consequences of real life. The affective relationship we have with art, stories and games lets us explore a sensory knowledge and develop our sense of what technologies might lead to and what technologies would be good for us – or not so good.
To understand how machine vision is affecting the way we humans see and relate to the world around us, we need to understand the relationships between humans and technologies. A few years ago, I proposed situated data analysis as a method for understanding how data is used and presented on various platforms. Situated data analysis explores how the same data is framed – or situated – in different ways for different audiences and purposes.23 It is about following the data, and machine vision converts the visual to data. A situated data analysis could be a useful method for examining how data from automated licence plate readers, for instance, is presented to police and processed in different situations, ranging from alerts received by officers, to dashboards the police department can use to analyse traffic flow, to the predictive policing algorithms that the data can feed into. In this book, however, I am interested less in the data itself and more in how we humans are affected by machines and in the technologies that sense and process the data. Focusing on stories, situations, assemblages and emotions allows me to bring that affect and those relationships into my analysis.
Human sight is the ability to perceive and interpret electromagnetic radiation, or light, in the visible spectrum. Our eyes and brains sense and process the light in our surroundings to create an image of the world that we use to orient ourselves. Machine vision technologies can also sense light, but they do not need to convert it into an image. They process light as data. Humans interpret different wavelengths of light as having different colours. Having input from two separate eyes, our brains interpret our stereoscopic vision as information about depth and distance. A self-driving car senses a lot of the same data about the environment as we do, in addition to other data such as GPS locations from satellites and data from the car and its engine. But there is no need for the computer to convert the data it gathers into a visual image, a two-dimensional representation of visual data. Instead, it processes the zeros and ones of its machine-readable data to calculate how it should respond to its surroundings. If we can even call this an image, it is a very different kind of image to the ones we are used to seeing in art museums, on the front of magazines or in YouTube videos and Instagram feeds. The car may well represent the data in visual form on a screen for the driver or passengers to see, but this representation is not necessary for the car to function.
A useful distinction can be made here between representational images, where the main point of the image is to show something, and operational images, where the main point is to do something. A snapshot from a family holiday or a painting on a gallery wall is a representation, whereas the images captured by the camera of a self-driving car are operational.
The term operative image was coined by the filmmaker Harun Farocki in 2001 in connection with his artwork Eye/ Machine. In 2004 he defined the term more explicitly: operative images ‘are images that do not represent an object, but rather are part of an operation.’ In 2014, the artist Trevor Paglen developed the idea further:
[T]he machines were starting to see for themselves. Harun Farocki was one of the first to notice that image-making machines and algorithms were poised to inaugurate a new visual regime. Instead of simply representing things in the world, the machines and their images were starting to ‘do’ things in the world.24
In practice, many images are both representational and operational. For instance, passport photographs have been used for more than a century as a means of verifying the bearer’s identity and are representations of the bearer’s face. With electronic processing of passports, the photos are also stored in databases where they can be processed and used for automatic identity verification. There is still a photograph representing your face in your passport, but more important is the digitally stored information about your face that is processed by a computer and compared to the data captured by the camera as you stand waiting for the gate to open. This digitally processed photograph is operational.
The ‘operational images’ that are generated and processed by the autonomous car or the passport gates at the airport, or by any number of other machines, are clearly not representational in the sense that the Mona Lisa or a movie are representations. But they are still constructed. The very act of deciding which data to collect shapes that data. The original ‘Blue Marble’ image, the photograph of the earth as seen from space, first released by NASA in 1972, was a snapshot captured on an analogue camera by an astronaut. But, as Laura Kurgan discussed in her book Close Up at a Distance, newer ‘photographs’ of the Earth as seen from space are the product of data processing rather than the capture of light that we know from analogue or optical photography. In these photorealistic images of an Earth with no cloud cover and perfect lighting, there is no direct relationship between what we see in the images produced by machine vision and the real world. It’s a ‘god trick’, as Haraway would say. Truth in such images is no longer a question of ‘seeing is believing’. Instead, as Laura Kurgan wrote, truth ‘is intimately related to resolution, to measurability, to the construction of a reliable algorithm for translating between representation and reality.’25
Once we realise that images aren’t just representational, we can begin to think more about what else images can do. If we understand ‘operative images’ as images that contain data and instructions for using that data, maybe we could say that all images are operative: they encode visual information in a way that can be processed by our eyes and brains and interpreted as a representation of something actual or imagined. Abstract art and architecture can cause us to feel in certain ways. We can also think of diagrams, maps and visualisations as operative images.
Carolyn L. Kane sees the decline of representational images as such a fundamental aspect of today’s society that she calls our time post-optical, arguing that we no longer use sight and visual elements as ends in themselves but as means to another end.26 Kane is particularly interested in colour, and she gives the example of chromakey video, where producers use a blue or green background – not because it will look good in the final image but so that the colour will ‘negate itself’, as Kane writes: the blue or green pixels will be replaced by another background image. In brain imaging, synthetic fluorescent proteins are inserted so that the final image can display the colourful flows to map brain function. Colour used to give us information to help us interpret our surroundings, but its function has changed: ‘Color is not exclusively about vision’, as Kane writes. ‘Rather, it is a system of control used to manage and discipline perception and thus reality.’
Kane’s term ‘post-optical’ is a nod to Friedrich Kittler’s monumental book Optical Media, a book composed of lectures he gave in 1999 on the material and technological development of media. Optical media, in Kittler’s framework, are media that can be seen and interpreted by the human eye at any point. Kittler never uses the term ‘post-optical’, but he describes the concept in his discussion of electronic media such as television: ‘In contrast to film, television was already no longer optics. It is possible to hold a film reel up to the sun and see what every frame shows. It is possible to intercept television signals, but not to look at them, because they only exist as electronic signals.’27
Contemporary machine vision is certainly post-optical in this sense. The computationally processed sensor data that allows a self-driving car to navigate is not something we can look at or perceive in any straightforward manner. When an AI model trained on hundreds of thousands of images classifies new images, it calculates statistical probabilities that an image represents a specific object. It doesn’t produce explanations why.
