E-Book
7 Tage kostenlos im Abo
E-Book
19,99 €

The Fiesta Data Model E-Book

Peter Menke

0,0

Erhalten Sie Zugang zu diesem und mehr als 300000 Büchern ab EUR 5,99 monatlich.

Herausgeber: Books on Demand
Kategorie: Geisteswissenschaft
Sprache: Englisch

Beschreibung

Diese Dissertation stellt ein Datenmodell zur Repräsentation experimentbasierter Datensätze aus dem Forschungsgebiet der multimodalen Kommunikation vor. Es werden Belege für die Existenz verschiedener Probleme und Unzulänglichkeiten in der Arbeit mit multimodalen Datensammlungen aufgezeigt. Diese resultieren aus (a) einer Analyse bestehender multimodaler Korpora und (b) einer Umfrage, an der Wissenschaftlerinnen teilgenommen haben, die zu konkreten Problemen in der Arbeit mit ihren multimodalen Datensammlungen befragt wurden. Auf dieser Grundlage wird herausgearbeitet, dass trotz der Existenz einer Vielzahl von Datenmodellen und Formalismen zur Darstellung klassischer Textkorpora sich diese nicht eignen, um die den multimodalen Korpora eigenen Besonderheiten abbilden zu können. Aus diesem Grund wird ein Datenmodell entwickelt, das all jene spezifischen Eigenschaften multimodaler Korpora zu berücksichtigen sucht. Dieses Datenmodell bietet Lösungen speziell für die Arbeit mit einer oder mehreren Zeitachsen und Raumkoordinaten, für die Darstellung komplexer Annotationswerte, und für die Transformation zwischen verschiedenen (bisher inkompatiblen) Dateiformaten verbreiteter Annotationswerkzeuge.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 476

Veröffentlichungsjahr: 2016

Das E-Book (TTS) können Sie hören im Abo „Legimi Premium” in Legimi-Apps auf:

Android

iOS

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Abstract

This thesis presents a data model for the representation of experimental data collections from the field of research on multimodal communication. We collected evidence for the existence of problems and shortcomings in the work with several annotation tools with (a)an analysis of existing multimodal corpora and (b)a survey among researchers working with multimodal data. On this basis we argue that, despite the fact that there are numerous data models and formalisms for the representation of classic text-based corpora, these are not suited for multimodal data collections. As a consequence, we developed a data model that takes into account the properties of such multimodal corpora. In particular, it supports temporal and spatial references, flexible data types, and controlled transformations between the file formats of several annotation tools.

Zusammenfassung

Diese Dissertation stellt ein Datenmodell zur Repräsentation experimentbasierter Datensätze aus dem Forschungsgebiet der multimodalen Kommunikation vor. Es werden Belege für die Existenz verschiedener Probleme und Unzulänglichkeiten in der Arbeit mit multimodalen Datensammlungen aufgezeigt. Diese resultieren aus (a)einer Analyse bestehender multimodaler Korpora und (b)einer Umfrage, an der Wissenschaftler_innen teilgenommen haben, die zu konkreten Problemen in der Arbeit mit ihren multimodalen Datensammlungen befragt wurden. Auf dieser Grundlage wird herausgearbeitet, dass trotz der Existenz einer Vielzahl von Datenmodellen und Formalismen zur Darstellung klassischer Textkorpora sich diese nicht eignen, um die den multimodalen Korpora eigenen Besonderheiten abbilden zu können. Aus diesem Grund wird ein Datenmodell entwickelt, das all jene spezifischen Eigenschaften multimodaler Korpora zu berücksichtigen sucht. Dieses Datenmodell bietet Lösungen speziell für die Arbeit mit einer oder mehreren Zeitachsen und Raumkoordinaten, für die Darstellung komplexer Annotationswerte, und für die Transformation zwischen verschiedenen (bisher inkompatiblen) Dateiformaten verbreiteter Annotationswerkzeuge.

Acknowledgments

There are many people to whom I am grateful with respect to this thesis, and I cannot list them all. Thus, I hope not to disappoint anybody by providing the following explicit enumeration (and, thus, giving the false impression of a closed set). I am grateful to

LEXANDER

EHLER

kickstarter of this endeavour.

AVID

CHLANGEN

supervisor and provider of several helpful hints and comments.

ETRA

AGNER

for being willing to be the second reviewer of this thesis.

ANS

-J

ÜRGEN

IKMEYER

for highly helpful insights and suggestions.

ARBARA

for the repetitive misappropriation of the “Forschungsseminar Sprache und Kommunikation / Linguistik romanischer Sprachen” in order to present snapshots of this thesis project (I hope the audience pardons me the mathematical shock treatment).

HILIPP

IMIANO

one of the principal investigators in Project X1, who procured room for my work on this thesis whenever it was necessary.

IRSTEN

ERGMANN

, A

NOUSCHKA

OLTZ

, F

ARINA

REIGANG

, P

ETRA

AECKS

, A

LEXANDER

EUMANN

, C

HRISTIAN

CHNIER

AND

ETRA

EISS

the people (in alphabetical order) who provided me with insights into the data structures of the corpora I used as my example collection. Many thanks to those who helped to unveil the mysteries of many of the yet-to-be-published data collections in the CRC.

ARINA

REIGANG

, F

LORIAN

AHN

AND

ARA

ARIA

ELLMANN

beta testers of the pilot version of the multimodal corpus survey. I am grateful for several helpful comments that lead to substantial improvements of the questionnaire.

Y FAMILY

who had to cut back, especially in the last months.

EBASTIAN

ENKE

for endless patience, support and supply — be it with (i. a.) encouragement, coffee, mathematical emergencies, or proofreading. Thank you.

Peter Menke

Bielefeld, October 2015

I Introduction

1 Introduction

1.1 Overview

1.2 Motivation

1.3 Relations to other works or publications

1.4 Conventions and declarations

1.4.1

Mathematical conventions

1.4.2

Bibliographic conventions

II Background: What are multimodal data?

2 Data

2.1 Introduction

2.2 An examination object

2.3 Data

2.3.1

Data are produced by mappings

2.3.2

Communicative events are transient

2.3.3

Complex mappings

2.3.4

Provenance of mappings

2.3.5

Conclusion

2.4 Primary and secondary data

2.4.1

Lehmann’s semiotic approach

2.4.2

Primary and secondary data in Gesprächsanalyse

2.4.3

Approaches from text and corpus linguistics

2.4.4

An approach from the area of speech databases

2.4.5

An improved definition

2.5 Conclusion

3 Modalities and multimodality

3.1 Introduction

3.2 Modalities in social semiotics

3.3 A linguistic and semiotic approach to modalities

3.3.1

Perception

3.3.2

Production

3.3.3

Coding systems

3.4 Conclusion

III State of the art: multimodal corpora and exchange formats for multimodal data

4 Example Data Collection

4.1 Overview

4.2 The Jigsaw Map Game (JMG) Corpus

4.2.1

Scientific background

4.2.2

Experimental setup

4.2.3

Data

4.3 The Bielefeld Speech and Gesture Alignment (SaGA) Corpus

4.3.1

Scientific background

4.3.2

Experimental setup

4.3.3

Data

4.4 The B6 Chat Game (CG) Corpus

4.4.1

Scientific background

4.4.2

Experimental setup

4.4.3

Data

4.5 Corpora dealing with emotions (EMO)

4.5.1

Scientific background

4.5.2

Experimental setup

4.5.3

Data

4.6 The Obersee (OBS) Corpus

4.6.1

Scientific background

4.6.2

Experimental setup

4.6.3

Data

4.7 The Tagesgespräch (TAG) Corpus

4.7.1

Scientific background

4.7.2

Provenance

4.7.3

Data sets

5 File formats and usability of annotation tools

5.1 Praat

5.1.1

The tool

5.1.2

The file formats

5.1.3

Evaluation

5.2 ELAN

5.2.1

The tool

5.2.2

File formats

5.2.3

Evaluation

5.3 Summary

6 An analysis of the example data collection

6.1 Issues identified in the example data sets

6.1.1

Hierarchies and associations between annotations

6.1.2

Descriptive annotation values

6.1.3

Complex annotation values

6.1.4

Vocabularies emerge during annotation

6.1.5

Values and boundaries of annotations change

6.1.6

Consistency

6.1.7

Data semantics

6.1.8

Documentation

6.2 General observations

6.3 Discussion

7 A survey about multimodal data and annotation tools

7.1 Introduction

7.2 Setup and participants

7.2.1

Setup

7.2.2

Dissemination

7.2.3

Participants

7.3 Results

7.3.1

Personal background

7.3.2

Selected data sets

7.3.3

Tool prominence and usage

7.3.4

Activities

7.3.5

Inside-outside ratio

7.3.6

Correlations between inside-outside ratio and other aspects

7.3.7

Searching

7.3.8

Visualisation

7.3.9

Expressing relations between annotations

7.3.10

File exchange

7.3.11

Working with Controlled Vocabularies (ELAN only)

7.3.12

Helpfulness

7.3.13

Additional comments

7.4 Discussion and conclusion

8 Generic exchange formats: motivation, advantages and criteria

8.1 Motivation

8.2 A catalogue of criteria

8.2.1

General considerations

8.2.2

Conversion, transformation, and merging

8.2.3

Referencing primary data

8.2.4

Structure

8.2.5

Annotation values

8.2.6

Meta-information

8.3 Conclusion

9 Existing candidates for an exchange format for multimodal data sets

9.1 Introduction

9.2 Generic theories and formalisms

9.2.1

The Annotation Graph paradigm

9.2.2

The NITE Object Model

9.3 Solutions associated with particular tools

9.3.1

Praat

9.3.2

ELAN

9.3.3

Anvil

9.3.4

EXMARaLDA and Folker

9.3.5

Interim findings

9.4 Text-focused solutions

9.4.1

The Text Encoding Initiative and the TEI guidelines

9.4.2

CES and XCES

9.4.3

LAF and GrAF

9.4.4

PAULA

9.4.5

SALT (and Pepper)

9.4.6

Interim findings

9.5 Solutions for multimodal corpora

9.5.1

ATLAS

9.5.2

“Exchange Format for Multimodal Annotations”

9.6 Conclusion

IV The generic exchange format and framework FiESTA

10 The FiESTA data model

10.1 Introduction

10.1.1

Goal of this part

10.1.2

Concepts and conventions

10.2 The basic structure of a FiESTA document

10.2.1

Header

10.2.2

Scales

10.2.3

Relating and aligning scales

10.2.4

Items

10.2.5

Layers and layer connectors

10.3 Auxiliary constructions and functions

10.3.1

Interval Calculations

10.3.2

Search and retrieval of items

10.3.3

Structures and partitions induced by layers

10.4 The type and constraint system

10.4.1

Constraints

10.4.2

Constraint Relations

10.4.3

Constraint set, domains, and maximality

10.4.4

Types as sets of constraints

10.4.5

Supertypes and subtypes

10.4.6

Using and exploiting the type system

10.5 A type-constraint system for multimodal annotation data

10.5.1

Scales

10.5.2

The layer set and its graph complexity

10.5.3

Intra-Layer Structure

10.5.4

Inter-Layer links between events

10.5.5

Values

10.5.6

Instantiations for relevant annotation file formats

10.6 Conclusion

11 The FiESTA file format and reference implementation

11.1 The FiESTA file format

11.1.1

A FiESTA document

11.1.2

The head

11.1.3

The scale set

11.1.4

The layer set

11.1.5

The item set

11.2 The FiESTA reference implementation

11.2.1

Accessors

11.2.2

Interfaces

11.2.3

Constraints and types

11.2.4

Extended use of FiESTA in the Phoibos corpus manager

12 Evaluation and conclusion

12.1 General considerations

12.2 Conversion, transformation, and merging

12.3 Referencing primary data

12.4 Structure

12.5 Annotation values

12.6 Meta-information

12.7 Conclusion

V Conclusion

13 Conclusion

13.1 Results

13.2 Critical reflection

13.3 Perspectives

Appendices

A The survey on multimodal corpora and annotation tools

Part I

Introduction

1 Introduction

Every step is a first step if it’s a step in the right direction.

TERRY PRATCHETT: I Shall Wear Midnight

1.1 Overview

THIS THESIS INTRODUCES and describes FiESTA, a new data model and library that assists researchers in creating, managing, and analyzing multimodal data collections. In this introductory chapter, we clarify the motivation for this project and, in parallel, give a commented overview of how each chapter contributes to the big picture. The visual roadmap in Figure 1 on the following page accompanies and illustrates this outline.

SECTION 1.2 describes our motivation and contains pointers to the respective chapters of the thesis.

SECTION 1.3 connects this thesis to other publications and projects from the wider context of multimodal corpora and data sets.

Figure 1: Visual roadmap for this thesis. Large, dashed boxes indicate parts, nested solid boxes stand for chapters. The narrative flow is shown as arrow connections between the chapters. Italic texts next to chapters outline the goals or accomplishments of their respective chapter(s). “GEF” (due to the restricted space in the diagram) stands for “generic exchange format”.

SECTION 1.4 introduces some conventions used in this thesis, along with some remarks about mathematical notations.

1.2 Motivation

Multi-modal records allow us not only to approach old research problems in new ways, but also open up entirely new avenues of research.

Wittenburg, Levinson, et al., 2002 : 176

THIS STATEMENT DESCRIBES a central development in linguistics and its neighbouring disciplines in the last decades: The focus of research is no longer on the purely linguistic component of communicative interaction only. Instead, interaction is understood as a complex interplay between linguistic events (typically, spoken utterances) and events in other modalities, such as gesture, gaze, or facial expressions (cf. Kress and Leeuwen, 2001; Knapp, Hall, and Horgan, 2013).

A couple of decades ago, technology could only provide limited support to this branch of research. Microlevel video analysis, for instance, originated in the last century: Back then, researchers used purpose-built film projectors that could play film reels “at a variety of speeds, from very slow to extremely fast, effectively achieving slow motion vision and sound” (Condon and Ogston, 1967 : 227). This served as the basis for detailed, yet hand-written, analyses of interaction on the level of single video frames.

Since then, researchers benefitted from various developments and technological shifts, such as easily available computing facilities and digitization of video and audio recordings: The fact that media recordings can be digitized means that there is no loss of quality in copies anymore. This is an improvement compared to situations where copies of analog media often were expensive while at the same time being lossy, thus also limiting the number of generations of copies that could be produced (cf. Draxler, 2010 : 11 f.). In addition, year by year, computational power, disk space and recordable devices (such as working memory or hard disks) become more affordable (Gray and Graefe, 1997).

In addition, the advent of high-level programming languages (in general, and especially in the scientific context) and the evergrowing supply with modular, reusable programming libraries containing solutions to many problems enabled the community to create annotation tools. These are special pieces of software suited to the needs of researchers in the field of multimodal interaction, such as the EUDICO Linguistic Annotator ELAN (Wittenburg, Brugman, et al., 2006), or Anvil (Kipp, 2001). Both tools support the playback and navigation of video and audio recordings, as a basis for the creation and temporal localisation of additional data. Similarly, for detailed phonological and phonetic analyses of sound files, the tool Praat (Boersma and Weenink, 2013, 2001) was developed.

With these tools and their wide range of possible operations, scientists work on a diverse range of research questions, investigating phenomena such as

– the synchronicity and cross-modal construction of meaning in speech and gesture signals (Lücking et al., 2010; Bergmann and Kopp, 2012; Bergmann, 2012; Lücking et al., 2013),

– the use of speech-accompanying behaviour signalling emotion, and its possible differences in patients and healthy subjects (Jaecks et al., 2012),

– the interaction of speech and actions in object arrangement games, with a focus on the positioning of critical objects in a twodimensional target space (Schaffranietz et al., 2007; Lücking, Abramov, et al., 2011; Menke, McCrae, and Cimiano, 2013),

– or the multimodal behaviour in negotiation processes concerning object configurations in miniature models of buildings or landscapes (Pitsch et al., 2010; Dierker, Pitsch, and Hermann, 2011; Schnier et al., 2011).

IN ALL DIALOGICAL1 situations investigated in these experiments, interlocutors produced several series of interaction signals over time – such as speech, gestures, facial expressions or manipulations of objects located in the shared space between interlocutors. These streams of interactive patterns are sometimes independent of each other. Often, however, multiple streams are coupled in a single interlocutor (e. g., in speech-accompanying gestures), and in other cases, the streams of different interlocutors are (at least locally) coupled (e. g., in coconstructions of speech, where a fragmentary segment of a linguistic construction is continued or completed by another interlocutor).

Figure 2: Schema of the data generation workflow in the research of multimodal interaction. Left: The different levels of data, and information about how subsequent layers are generated out of prior ones. Right: An example of primary and secondary annotations based on the segment of a recording (containing a speech transcription, an annotation of gesture, a syntactic analysis of the speech, and a secondary annotation expressing an hypothesis about how items on both the speech and gesture form a joint semantic unit.

A detailed and thorough analysis of such dialogues typically pursues the following course (cf. Figure 2; the following description indicates numbers in the diagram for easier reference):

First, video and audio recordings of the interaction are created ().

To simplify work, further references to these recordings (and, indirectly to the events of the original situation) refer to an abstraction in the shape of a so-called

timeline

(). Points and intervals on this timeline are the only link to the underlying media files, since (under the assumption that all media have been synchronised) every segment in the media files can unambigously be referenced with such a time stamp information.

Then, researchers create

primary annotations

on the basis of these media recordings (). This is done by identifying points or intervals on the timeline and associating them with a coded representation (typically, by using text) of the observed phenomenon.

In addition, it can be necessary to generate annotations on an additional level, so-called

secondary annotations

(). These do not refer to temporal information directly. Instead, they point to one or multiple annotations (cf. Brugman and Russel, 2004 : 2065). They typically assign a property or category to an annotation, or model a certain kind of relation between two or more annotations.

“DATA” AND “MODALITY” are two terms which, although researchers have an intuitive understanding, are often deficiently defined. Therefore, we prepend two chapters to this thesis that attempt to clarify the exact definitions of terms from the two fields of data (Chapter 2) and of modalities (Chapter 3).

While most of the investigations concerning multimodal interaction follow the basic schema described above, its differentiations per project can diverge substantially. This is mostly due to the fact that different research questions often require idiosyncratic data structures and different descriptive categories (as, for instance, for the description of non-linguistic behaviour).

IN ORDER TO give a more detailed overview of how these data structures can be designed, and how they diverge against the background of varying research questions, descriptions of a sample of multimodal data collections, along with the underlying research questions, are presented in Chapter 4, starting on page →. This is accompanied by an introduction to the graphical user interfaces and the file formats of two annotation tools that were repeatedly used for creating the example data collection: Praat and ELAN (Chapter 5, starting on page →).

FIRST AND FOREMOST, the annotation tools mentioned in the previous section provide a solid basis for the research of multimodal interaction. And yet, as will be shown in the following chapters, there are still areas and specific tasks where these general-purpose tools fail, and where creative, but ad-hoc solutions are implemented. Examples of such problematic tasks are

– the creation of a certain connection inside the annotation structure for which the developers of the tool did not provide a solution,

– the automated creation and seamless integration of an additional layer containing part-of-speech tags (a task that can almost effortlessly be performed when working with text-based corpora),

– a calculation of quantitative relations of interesting patterns in multiple layers,

– or a customizable visualisation of such patterns.

Thus, as a starting point for our investigation we take the following claim that summarises these (and other) issues:

CLAIM 1

Investigators of multimodal interaction need better support in various areas for the collection, analysis, visualisation, exchange, storage, and machine comprehensibility of their corpora.

An analysis of the example data collection reveals first bits of evidence for this claim and tries to complement them with observations and results from literature. This analysis is given in Chapter 6 (page →).

HOWEVER, INFORMATION CONCERNING issues in data generation and analysis is sparse in scientific publications, and our analysis of the example data can only produce hypotheses about potential problems and issues. Therefore, a survey among creators and producers of multimodal data collections was conducted in addition. Particularly, this survey examined what kinds of problems existed for researchers, and which of them impeded them most during their work, what tasks they needed to perform in order to answer specific research questions, and how features and solutions should look that could be able to assist and support them in answering specific research questions. This survey, its design considerations, its realisation, and its evaluation, are presented in Chapter 7 (starting on page →).

IN SEVERAL OTHER areas with concurring data formats, a solution to such a set of problems was to develop a common exchange file format – one central format that can model the data structures and represent the information contained in all other formats. The advantage of such a central format is that, once data conversion routines between the common format and third-party formats are established, any subsequent task only needs to be implemented once – for the common format. In the past, such exchange (or pivotal) formats have successfully been created in different areas. Also, several exchange formats have been developed and (sometimes) standardised in the field of linguistics, such as the modular schemas for representing different sorts of texts by the Text Encoding Initiative (TEI; TEI Consortium, 2008), the Linguistic Annotation Framework (LAF; Ide, Romary, and Clergerie, 2003; Ide and Romary, 2006), the PAULA format (Potsdamer Austauschformat für linguistische Annotationen; Zeldes, Zipser, and Neumann, 2013), and many more.

An obvious choice would be to identify and use an exchange format that is suited for representing complex multimodal data sets. In order to evaluate candidates for such a purpose, we analysed the collected evidence (both from the literature review and the survey), and transformed it into a catalogue of critera for exchange formats. This catalogue will then be used to evaluate exchange format candidates. The advantages of generic exchange formats and the resulting catalogue of criteria are described in detail in Chapter 8 (starting on page →).

AT FIRST GLANCE, many file formats that have successfully been used to represent textual data seem to be promising candidates also for the representation of multimodal interaction data. However, an evaluation of these formats revealed that multimodal corpora have conceptual and structural differences from text-based corpora that make it difficult to apply such a formalism.

One of these problems is that many of the common formats presuppose the existence of one single, flat stream of primary data – typically, a text or a non-overlapping sequence of transcribed utterances. In these approaches, often either tokens are marked in the primary text which can then be referred to from annotations, or locations in the text are described using numeric character offsets. However, this approach cannot express multimodal interaction data in an adequate way. While there are numerous reasons, we present three of them that we consider especially important:

In classical corpus linguistics, the primary data is already present in the shape of the finalised (thus, immutable) text to be annotated. In contrast, in multimodal interaction studies such a text does not exist a priori, but it has to be produced in the form of a transcription. Such a transcription must itself refer to a kind of axis more suited to the situation, that is, a timeline, and optionally, also to spatial coordinates. Approaches that use only character sequences as their primary axis therefore have no adequate way of representing these temporal and spatial coordinates.

A textual primary axis, be it segmented on the word or character level, has a much lower resolution than a timeline. In addition, the axis distorts temporal relations and makes the comparison of durations and distances impossible, because character-based lengths and distances cannot be compared to temporal or spatial ones.

In multimodal interaction, often multiple streams of events cooccur which cannot be flattened down into a single sequence without discarding large amounts of important information about the order.

These and other problems have become evident in the literature review as well as the survey. In order to have a reliable assessment of to which degree existing solutions meet the requirements in the field of multimodal interaction studies, an evaluation of known data models, libraries, and file formats was conducted that have been proposed or used for the handling of linguistic (and, possibly, multimodal) corpora. The result of the evaluation shows that none of these data models meets a sufficient amount of criteria from the catalogue collected earlier. Chapter 9 (beginning on page →) contains this detailed evaluation of existing solutions.

THE RESULTS FROM this evaluation underpin the second central claim of this work:

CLAIM 2

There is (to our best knowledge) no known solution (in the shape of a theoretical or implemented data model) that meets the important requirements that researchers have when investigating multimodal interaction.

Since the evaluation of all solutions under examination did not expose a data model as a suitable candidate, the final part of this thesis describes the design and development of FiESTA, a novel data model for solving as many of the aforementioned issues as possible. This data model, since it has been designed as a specific match to the criteria catalogue, is expected to provide better solutions to the problems in multimodality research. We exemplify the usefulness of FiESTA by describing implemented and potential improvements of the workflow of scientists along with an evaluation of the formalism against the criteria catalogue.

The formal specification and documentation of the so-called Format for extensive spatio-temporal annotations (FiESTA) can be found in Chapter 10 (beginning on page →). Summaries of the XML-based file format and the pilot implementation are given in Chapter 11 (page →), and a conclusive evaluation in Chapter 12 (page →).

SINCE A SINGLE thesis does not provide enough space for a thorough description and documentation of such an ambitious project, the conclusion evaluates on a meta-level what has been achieved in the thesis, and what further developments and improvements appear promising. This conclusion and the outlook are presented in Chapter 13 (page →). In addition, Appendix A (page →) presents the questions of the survey in detail.

1.3 Relations to other works or publications

This thesis project is closely related to and embedded within the endeavours of Project X1 “Multimodal Alignment Corpora” within the Collaborative Research Centre (CRC)2 673 “Alignment in Communication”3. X1 provided solutions for both low-level storage and sharing and high-level administration and analysis of a variety of data collections dealing with multimodal dialogues. The models and products of this thesis provided the theoretical and structural basis for several of these solutions. Sometimes, early versions of models and implementations have already been used an integrated into services and applications (one of them being the Phoibos corpus manager, which is mentioned in Chapter 12).

A DRAFT OF THE MEXICO MODEL (which is a sister project of FiESTA that aims at representing whole multimodal corpora; MExiCo stands for “Multimodal Experiment Corpora”) for managing multimodal corpora has been summarized and published in Menke and Cimiano (2013).

THE GENERAL IDEA OF A CORPUS MANAGEMENT APPLICATION (and also of the underlying functionality that eventually resulted in the FiESTA and MExiCo libraries) has been outlined in Menke and Mehler (2010) and Menke and Mehler (2011), and plans were to integrate the X1 solutions with the eHumanities Desktop System (Gleim, Waltinger, Mehler, et al., 2009). However, due to personnel reorganisation issues within Project X1, this agenda was abandoned in favor of the development of the current approach.

THE NOTION OF A GENERIC SCALE-BASED APPROACH to modelling multimodal annotations (as described in this thesis) is based on earlier work: on Menke (2007), where a general scale concept for an improved modelling of overlapping discourse segments, based on Stevens’ levels of measurement (Stevens, 1946), is described, and on (Menke, 2009), where metrics for the calculation of the synchronicity of interval-based annotations, especially for multimodal ensembles4 consisting of speech and gesture parts, are introduced.

FIESTA, while being a central subject of this thesis, is present as a draft in various earlier publications (not necessarily bearing this name, but possibly under its previous working title “ToE”, short for “time-oriented events”), among them Menke and Mehler (2010), Menke and Mehler (2011), and Menke and Cimiano (2012).

PRELIMINARY INVESTIGATIONS TOWARD A MACHINE-READABLE, RDF-BASED ONTOLOGY OF MULTIMODAL DATA UNITS AND PHENOMENA based on the works of Chapter 3 have been discussed at a workshop on multimodality in the context of the LREC 2012 conference, and have been published afterwards in Menke and Cimiano (2012).

THE CHAT GAME CORPUS developed in Project B6 of the CRC 673 has been summarized and described in Menke, McCrae, and Cimiano (2013).

THE ATTEMPT OF DEVELOPING A PROTOTYPE FOR A MULTIMODAL TOOL CHAIN using FiESTA as the central exchange file format is described in Menke, Freigang, et al. (2014).

1.4 Conventions and declarations

1.4.1 Mathematical conventions

MEDIAN AND QUARTILES. We denote the median of a distribution as μ, and the interval between the lower and upper quartile of a distribution as Q1,3.

SIGNIFICANCE. The results of a statistical significance test are called highly significant and marked with ** if p ≤ 0.01, and they are called significant and marked with * if 0.01 < p ≤ 0.05.

BOOLEAN VALUES. is the set of Boolean values true T and false F.

ACCESS TO SEQUENCE ELEMENTS. The nth element of a sequence s is denoted by s[n].

DOT NOTATION. For structures that have subordinate components, a dot notation inspired by the member access syntax of several programming languages is used. For instance, if b is a book, then b.TITLE retrieves its title, and b.CHAPTERS returns an enumeration of its chapters.

We consider this notation more readable than multiple nested predicate expressions.

1.4.2 Bibliographic conventions

Since this thesis originates from a research project at a German university, we assume that there will not be any need for translations of German quotations. Translations from languages other than English or German were created by us, if not explicitly stated otherwise by the addition of a source of the translation. Highlighting and structure in quoted passages originate from the original authors, if not explicitly stated otherwise.

Pages in the World Wide Web are used in two different functions in this document: as a means for the mere localisation of a resource, and as evidence for an argument.

If a page serves as the entry page or frontdoor page of a product, object, or other resource that the text refers to, then the link to the page is given in a footnote.

If a page contains information that serves as evidence for arguments in the text, then it is inserted as an ordinary citation. Title information is taken from the TITLE element in the HTML header. Author information is obtained from the text body or from the HTML header. If no author could be detected, the citation displays a custom shorthand in the text and as the label in the bibliography, preceded with the ° character, such as °SyncWriter1.

All web resources have been checked for functionality on 26 May 2015, unless we provide a different date in the reference.

1 Throughout this thesis, “dialogue” and “dialogical” explicitly include communicative situations with more than two participants (for which sometimes the term “multilogue” is used).

2 This official English translation of the German term “Sonderforschungsbereich” (SFB) is imprecise, a better version is “specialised research department”. However, due to the official status of the first term, we will use it (or its abbreviation).

3http://www.sfb673.org

4 Multimodal ensembles were introduced in Mehler and Lücking (2009) and Lücking (2013), see also below.

Part II

Background: What are multimodal data?

2 Data

MERE ACCUMULATION OF OBSERVATIONAL EVIDENCE IS NOT PROOF.

TERRY PRATCHETT: Hogfather

2.1 Introduction

WHEN TALKING ABOUT solutions for the management and analysis of multimodal data5, it is advisable to have at least a basic agreement on the terms used in this phrase. However, although most researchers from fields where recorded dialogues and communicative situations are analysed have an intuitive agreement upon their terms (especially primary data and secondary data), it is not trivial to find definitions for them that really match their usage.

As a consequence, this chapter clarifies what is to be understood under the superordinate term of data, and then summarises the most prominent readings and definitions of primary and secondary data. It will become apparent that some are quite closely related, while others make use of these terms while having nothing to do with the others conceptually. These different levels of relation will be analysed in the conclusion of this chapter.

Figure 3: A ficticious experiment, visualised as a graph, where nodes represent different representations of data, and edges express the relations between them (an edge from a node to another node indicates that the creation of data set was influenced by or based on data set

2.2 An examination object

FOR THE REST of this chapter we will refer to the example configuration of a multimodal experiment depicted in Figure 3. This figure contains a schematic overview of the data sets resulting from a typical (yet ficticious) experiment. The original situation, and the representations derived from it are given numbers, while the data mapping operations between them are given letters for easier reference. This setup may seem artificial, yet it contains several relations and data types that will help understanding the differences of the different variations in data nomenclature. The representations are:

THE ORIGINAL SITUATION. This is the sequence of communicative events in reality (which, as will later be shown, is volatile, and has to be recorded and documented).

DIRECT VIDEO AND AUDIO RECORDINGS. These are conventional, immediate recordings of the visual impression and the sound that occurred in the original situation. They are stored using discrete frames or samples, which, when played, create an impression of moving pictures and continuous sound.

A LIVE TRANSCRIPT. This is a log created by an observer who was present in the real communicative situation and wrote down immediately what was said in the dialogue, and by whom.

A CONVERTED VIDEO FILE. This is a video file that has been automatically converted from . This could be a video file that is reduced in file size, image resolution, or that has been compressed using a video codec.

GESTURE ANNOTATION. Based on audio and video recordings, specially trained annotators create time-based markings and classifications of the gestures produced by interlocutors in the original situation.

SPEECH TRANSCRIPT. Based mainly on the audio recordings, transcriptions of the speech (to be more exact, of utterances and words) of the interlocutors in the original situation is created. These transcriptions are created with a special software that allows for the marking of temporal position and duration of the utterances and words.

PART-OF-SPEECH ANNOTATION, AUTOMATED. A piece of software takes speech transcripts as input and assigns part-of-speech information to the units on the word level, according to the algorithm implemented, and the underlying data sets (these can be lexica, dictionaries, corpora, etc.).

PART-OF-SPEECH ANNOTATION, MANUAL. An annotator investigates the units on the word level found in the speech transcripts, and assigns part-of-speech information to them, based on his grammatical competence, and material he has at his disposition (again, lexica, dictionaries, etc.).

aspectKertesz & Rakosi (2012)Lehmann (2005)representationstatementrepresentationvaliditywith a positive plausibility valuewhich is taken for grantedbased onoriginating from some directof the epistemic object of someoriginalsourceempirical research

Table 1: Comparison of elements in definitions of the concept “data”.

UNIFIED SPEECH TRANSCRIPT. The data contained in the live transcript and the recording-based speech transcript are analysed and combined into a single resource, in order to reduce errors and disambiguate missing or disputable portions.

2.3 Data

THERE ARE SEVERAL definitions and interpretations of data. Although they are all related and refer to similar things, they still differ slightly, depending on disciplines involved and purposes pursued.

According to Kertész and Rákosi (2012 : 169), “[a] datum is a statement with a positive plausibility value originating from some direct source”. Christian Lehmann defines the term in a similar way. For him, “[a] datum is a representationi of an aspect of the epistemic object of some empirical research whichi is taken for granted” (Lehmann, 2005 : 182). In these two definitions, three relevant components can be identified (cf. Table 1):

Data are

representations

of the objects of study. This also includes that a datum is

tangible

, that is, it is materially manifested and can be accessed used as the basis for an analysis.

The objects of study (entities, their properties and relations interesting to researchers) serve as a

basis

for the creation of data.

People usually agree that these representations have a certain

quality

that makes them usable as a basis for a scientific argumentation, proofs, and similar things (often this quality is called

validity

, cf. Menke, 2012 : 288 f.).

There are, however, some minor issues with both definitions. First, we consider “statement” too narrow a concept to model several of the types of data multimodal research deals with, especially raw signal data in audio or video recordings. They are not statements, because they do not have any propositional content, at least not without an additional level of interpretation. Also, for an object to serve as a valid datum, it needs more than just any positive plausibility value. Plausibility should be high enough that one can rely on it in order to draw valid conclusions. There is, however, no absolute threshold, it depends on the situation. At least, the plausibility value should be significantly higher than a chance value or baseline (cf. Menke, 2012 : 304 f.).

A working definition based on the two definitions cited above could be:

DEFINITION 1

A scientific datum is a valid and processable representation of an object of study, or of one or more of its aspects or properties.

In this definition, “valid and processable” mean the following: The validity value must be high enough to minimize errors and doubts, and data must be in a state such that the chosen measurements and operations can be applied. This presupposes that data is tangible and durable. As a consequence, actions and events, being immaterial (thus, non-tangible) entities, are not considered data according to this definition.

On this basis, some additional aspects and properties of data will be summarized and discussed in the following subsections.

2.3.1 Data are produced by mappings

According to Stachowiak (1965, 1989), data sets are the result of a modelling process – that is, representations of an original which come into existence by a mapping operation. In a good model, relations and properties of the original are reflected in corresponding relations and properties of the representation. If this holds for all relations and properties under consideration, we are dealing with isomorphicmappings.

Stachowiak’s model concept is rather general, it covers scientific modelling processes as well as examples from completely different fields; e. g., photographs of real situations (Stachowiak, 1965 :439), or globes and spheres as models of planets (Stachowiak, 1965 :444). Scientific models are one special case of models for which he postulates some additional, special properties (Stachowiak, 1989 : 219). Here, objects of study (which are interesting to a certain scientific field) are mapped to representations that make it possible to apply research methods. Lehmann (2005) describes how such a configuration looks in the area of dendrochronology:

The data on which dendrochronology builds its theories are series of numbers, each of which represents the width of an annual ring of some tree and is associated with one in a series of years. The crosssections of the tree may be stored somewhere for measurements, because they constitute the ultimate basis of reference for certain relevant observations. The data, however, are those series of numbers insofar as they represent facts about these objects.

Lehmann, 2005 : 179

In this example, the cross-sections of trees are the objects of study. They serve as originals for the mapping operation that assigns numbers to them which express distances and lengths. These series of numbers are the data that are used in analyses and evaluations.

2.3.2 Communicative events are transient

Disciplines such as dendrochronology or archaeology regularly deal with physical, material objects of study (tree sections, shards, bones) which exist or existed in the real world, and, thus, are rather easily accessible. Lyons (1977 : 442) calls these first-order entities. However, other disciplines do not have this opportunity. Historians and linguists both are interested in events (Lyons, 1977 : 443: second-order entities) rather than objects (cf. Lehmann, 2005 : 180). Especially, for the area of empirical linguistics, a typical object of study is the set of events that occurs in a dialogue or another communicative situation. Such events cannot serve as direct material because they do not exist in a physical way – they occur rather than exist, and they are volatile, meaning that they get lost immediately after they occur (cf. Lehmann, 2005; Allwood, 2008; Menke and Mehler, 2010). However, their physical manifestation can be preserved (for instance, in video and audio recordings, or in live transcripts, as in the example configuration). As a consequence, this means that material in these areas does not consist of the objects of study themselves, but of recordings of them (cf. Lehmann, 2005 : 179 f.). These recordings then form the most important resource conserving the events from reality. Figure 4 shows an adapted version of Figure 3, where the original situations are not included anymore. The remaining subset displays those entities that are tangible and, therefore, count among actual data sets. From now on, we will follow Figure 4 and cease to categorise as a data set.

Figure 4: The ficticious experimental setup from Figure 3, at a time point after completion of the dialog situation.

2.3.3 Complex mappings

The creation of data from an object of study can consist of a single, simple operation, such as a direct measurement of distances in the cross-section of a tree. Regularly, however, more than one operation needs to be applied for the creation of a representation that can be used for obtaining a final result. In this case, we deal with a more complex configuration of mappings. This could be a sequence of mappings, where the result of a mapping step serves as an original in the following step (cf. Stachowiak, 1965 :438). For instance, researchers could decide to first take a picture of the cross-section to preserve spatial relations (after all, the cross-section could break, dry or wither over time and therefore the proportions of its annual rings could be altered), and then perform the distance measurements on the photograph rather than the original, resulting in the sequence

cross-section ↦ photograph ↦ measured distances

In our example configuration, we can find different sequences manifested as paths in the graph. For instance, the original situation is recorded as an audio signal , which then serves as an original for the creation of a derived representation in the shape of a speech transcript :

In other cases, however, more complex, graph-like structures of representations can be the result. The example configuration contains several examples of graph structures more complex than linear sequences. The original dialog situation serves as an original for three modeling processes , , and : The creation of both video and audio recordings and the creation of the live transcript . On the other hand, the unified transcript is based on two originals: The live transcript and the recording-based transcript .

Authors agree that there is no such thing as the one, single, correct model of an original, and this holds also for the special cases of scientific models. Stachowiak claims that models “usually represent their originals only for specific subjects [. . . ] (the users of the model), during specific time intervals [. . . ] and with restrictions to specific (mental or physical) operations”6. Similarly, according to Lehmann, “[n]othing is, in and of itself, a datum; instead, it is a datum for somebody (or for a scientific community) in some perspective” (Lehmann, 2005 : 181).

In other words, there is no single, correct way of creating a scientific data representation for an original of a certain type. In a scientific context, especially the identity of the creator and the underlying theories, assumptions and interpretations are important which guide the design of the operations to be used when creating a representation. For an example, see Figure 5. It contains a series of representations of an utterance, consisting of the single word “Mexico”. Each of these representations is useful in a different context, and to different groups of researchers investigating different questions: While phoneticians can pose and investigate research questions solely by looking at and evaluating, e. g., pitch patterns or formant progression, conversation analysts might be interested in word-based speech transcripts only.

Figure 5: Different representations of an original, consisting of the utterance of the word “Mexico”: (a) an oscillogram; (b) a sonagram; visualisations of (c) pitch and (d) formants; time-aligned textual transcriptions of (e) phonemes, (f) syllables, and (g) words.

2.3.4 Provenance of mappings

When we look at representations and it becomes apparent that both represent the same aspect of the linguistic events, namely, part-of-speech information of the words uttered during dialog. The example does not state whether these two representations are equal, similar, or whether they diverge. They could be identical, but there is another important aspect to them that is not related to some quality inherent to the representation itself: It is the genesis and provenance of the two representations – it is important who created the data, what techniques, algorithms, and methods he used, and on which theories, assumptions, and prerequisites he based the process.

While and share their structure and appearance, was created by the implementation of an algorithm that may or may not have based its work on actual grammar information, while is the product of a human annotator who selected values according to his interpretation of the material against the background of his or her grammatical competence. This difference can cause large, systematic differences in the data. Thus, in order to fully interpret a data set, often information about its provenance is required.

2.3.5 Conclusion

Up to now, the following properties of data sets have been assembled:

Data sets are processable representations of the thing to be analysed.

Data can be the result of subsequent, and, sometimes, very complex processes involving several originals.

A data set often is not analysed, interpreted, and stored in isolation, but rather with the context of its creation in mind. In other words, the related provenance information often should be documented as well.

On this basis we will now try to distinguish different kinds of data sets. An audio recording, for instance, is different from a speech transcript in many ways. Many people would call the audio recording a primary datum, and the transcript a secondary datum. However, the criteria for this distinction are not always clearly defined. In the next sections we will enumerate items in a category system that can help distinguish different kinds of data on a clearly defined basis.

2.4 Primary and secondary data

In the previous section we established that the data generation process can be interpreted as a directed acyclic graph, with the single representations as its vertices. This concept of data will now help us to understand the different variants of the concepts primary data and secondary data. Some of their aspects can be related to properties of the representations themselves, while for others the position in the data generation graph, and the surroundings of a representation determine whether it counts as primary or secondary.

There is a particular use of the terms primary data and secondary data in the fields of economics, sociology and related areas (for instance, in Aaker, 2011 and numerous other works) which we will not adopt in this thesis. In it, data is categorized as primary or secondary depending on whether it had explicitly been created for a study, or whether it had been re-used (see margin note).

The reading that is relevant for our project separates data into the categories of primary and secondary data depending on how closely related they are to the original events with respect to the mapping processes during data collection or production. There are several variants of this overarching reading, and they have differences – most of them being subtle. As a result, one and the same object that is categorized in one theory as a primary datum counts as a secondary (or even tertiary) datum in others. Nevertheless, for all of them it holds that primary data comes before secondary data in the mapping process (and, if tertiary data is mentioned, primary and secondary data come before tertiary data). Some of the most influential theories and definitions are:

In economics or sociology, authors regularly classify data as primary or secondary based on two criteria: A datum is considered primary for a specific investigation by a specific user if either the creator is identical to this user, or the datum was collected for the purpose of this investigation. Else, data is classified as secondary. In this reading, a datum is not primary or secondary by its properties. Instead, the same data set can sometimes be primary and sometimes secondary (depending on whether creator and user are identical, or whether current and original use differ).

Aaker (2011) gives definitions in which he mixes both readings, switching from the usage criterion (p. 77) to the collector criterion (p. 92), and, on the very next page, back again.

Obviously, these definitions are problematic with a strict interpretation. For instance, a data set counts as secondary if the author of the study did not conduct the actual experimentation, but his uncredited assistant did. Better yet – as soon as that assistant is credited, the data set, by definition, mutates into a primary one.

Lehmann (2005), a seminal publication on nature and properties of data in linguistics,

Brinker and Sager (2010), and other works based thereon, for instance, Kowal and O’Connell (2003), which describe the view on data sets from the perspective of Gesprächsanalyse,

readings from the areas of corpus linguistics and its neighbouring subdisciplines,

and a reading from the field of speech databases.

We will exemplify the differences on the basis of the ficticious data collection procedure from Figure 3, which was already used in the previous sections.

2.4.1 Lehmann’s semiotic approach

Christian Lehmann proposes “a semiotic conception of data” (Lehmann, 2005 : 175), where

[k]inds of data are distinguished by their ontological status, degree of abstractness, the type of sign representing them and their originality (Lehmann, 2005 : 175).

In particular, Lehmann defines three categories of binary distinctions of data:

Original

and

derived

representations.

Raw/non-symbolic

and

symbolic

representations.

Primary

and

secondary

data.

While the two first ones are relevant to this examination, interestingly, the third one, while actually mentioning the terms to be defined, is only marginally relevant, as we will illustrate later.

Original and derived representations. Original data have been created directly from the ultimate substrate7, they are “not based on another representation” (Lehmann, 2005 :183). Derived representations, on the contrary, are based on at least one other representation.

In our example case, the representations , , and are original data in this sense because they are based on the original situation only. All other representations below them are derived representations in Lehmann’s nomenclature.

This distinction takes into account the provenance and context of a data set, but none of their actual, intrinsic characteristics. It is not possible just by looking at an audio recording (that is, by looking at the data itself, not at any descriptive metadata) to determine whether the recording had been obtained by a recording of a sound source from reality or by a conversion of another media recording.

At least in the case of this example, the reading of original and derived data is rather close to how researchers often implicitly distinguish their understandings of primary and secondary data: Primary data in this reading is strongly connected to Lehmann’s original data in the sense that it is being recorded live during experimentation, thus not being based on any other representation.

The difference, however, is also strongly related to another distinction of data: In a psycholinguistic experimental setup, typically original data have the form of raw data (mostly as video or audio recordings), while derived data often include symbolic data, such as transcriptions or annotations.

Raw and symbolic representations. Typical cases of raw data are audio or video recordings, which, according to Lehmann, are non-symbolic, no matter whether the recording is analog or digital (cf. Lehmann, 2005 : 182). In contrast to this, Lehmann locates symbolic representations at some later stage in the graph of representations: “Higher level linguistic analysis of any data commonly presupposes their symbolic representation.” (Lehmann, 2005 :182) In our example, the items , , and are raw (or non-symbolic) data, while all others are symbolic.

Lehmann’s two-dimensional distinction is a possible explanation of why authors regularly confuse the four related terms: because it is the default case that recordings (being raw) are used as original data, while transcriptions and annotations (being symbolic) form derived data. Then in a default situation, it is often also the case that no other raw representations apart from the original recordings are created. This results in a setup where the set of original representations O equals the set of raw representations R, while the set of derived representations D equals the set of symbolic representations S. We assume that this frequent overlap of representation types can lead to the aforementioned confusion in the usage of these terms, namely, that authors tend to put raw data on a level with original data, and symbolic with derived data, ignoring the fact that a data set could also be raw and derived, or symbolic and original. The following enumeration describes two instances of such data sets:

In a strict interpretation of Lehmann, the generation of a video recording out of another (already existing) video recording (as in ) yields a

raw

yet

derived

representation: It uses a non-symbolic coding system while being based not on the real situation, but on another representation.

A live transcript (as in ) is a

symbolic

yet

original

representation: It is based on the real situation only, but codes its data using a symbolic coding system.

Another important characteristic is that the property of being raw or symbolic is intrinsic to the data sets themselves: An item is either raw or symbolic by its very nature, not by its position in an experimental representation structure.

Primary and secondary data. Finally, Lehmann makes a difference between primary and secondary data. To him, primary data are “representations of specific speech events with their spatio-temporal coordinates, i.e., of objects with a historical identity” (Lehmann, 2005 : 184). In contrast to this, secondary data are not as clearly defined as primary data, they are merely “more abstract in some respect” (Lehmann, 2005 : 184). Instances of secondary data are system sentences, example sentences or metadata descriptions (cf. Lehmann, 2005 : 184 f.).

This insufficient definition of what secondary data is supposed to be is problematic. Especially, it does not clarify what is meant exactly by the spatio-temporal identity. We argue that time-aligned transcriptions have this identity just as media recordings, since the principle of time-alignment in transcriptions or annotations is, in essence, the same as used in video and audio files. As a consequence, transcriptions (if they have time information) have a historical identity as much as video and audio recordings do. We can conclude that all time-aligned transcriptions and annotations also count as primary data according to Lehmann’s definition. However, it remains questionable whether this really is what the author intended.

TO CONCLUDE, LEHMANN proposes an inventory of highly useful categories, especially the two independent dimensions of original vs. derived, and of raw vs. symbolic representations. It is important for a thorough conception to understand that these two dimensions are freely combinable, and there are not only the two prominent combinations of original-plus-raw, and derived-plus-symbolic representations. The third dimension of primary and secondary representations (as the terms are used in his theory), though, is not quite helpful for our understanding, since it seems to be the case that all representations of experimental situations in our field would pass as primary data, since they all have a spatio-temporal identity. It is therefore advisable to look for additional theories and conceptions that can complement Lehmann’s categories in order to achieve a full and correct classification of the data sets from the example set in Figure 4.

2.4.2 Primary and secondary data in Gesprächsanalyse

Gesprächsanalyse (GA) is a linguistic approach that focuses on qualitative-inductive analyses of natural, authentic communication. While being related to ethnomethodology and Conversation Analysis (CA), it is especially prevalent among German-speaking researchers and has some distinctive characteristics of its own.

In their introduction to GA, Brinker and Sager (2010) also categorise data based on the degree of abstraction away from original events, with a clear foundation in Stachowiak’s model concept (Stachowiak, 1989), and they also make use of the terms primary and secondary data. However, a crucial difference is that they consider original dialogical events a data set in its own right: “Die ursprünglichen Daten sind in unserem Fall die realen Gespräche der Alltagswelt” (Brinker and Sager, 2010 : 34). As a consequence, events from reality are called primary data, and all other levels of abstraction are shifted by one position: Data sets that would count as primary data in Lehmann’s reading (such as written logs and tape or video recordings) are called secondary data here (Brinker and Sager, 2010 : 34 f.). The level of subsequent representations (such as transcriptions) shifts from secondary to tertiary data: “Wir müssen also noch die Herstellung von Transkriptionen als Tertiärdaten vor die eigentliche Analyse einschieben” (Brinker and Sager, 2010 : 35).

Although this model is clear and accurate, it is incompatible with the fact that the events from reality, according to Definition 1, are not data sets (because they are no tangible representations). Also, it bears the fundamental risk of confusion of terms due to the fact that all designations have been shifted by a level.