Multicore DSP - Naim Dahnoun - E-Book

Multicore DSP E-Book

Naim Dahnoun

0,0
105,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

The only book to offer special coverage of the fundamentals of multicore DSP for implementation on the TMS320C66xx SoC 

This unique book provides readers with an understanding of the TMS320C66xx SoC as well as its constraints. It offers critical analysis of each element, which not only broadens their knowledge of the subject, but aids them in gaining a better understanding of how these elements work so well together.

Written by Texas Instruments’ First DSP Educator Award winner, Naim Dahnoun, the book teaches readers how to use the development tools, take advantage of the maximum performance and functionality of this processor and have an understanding of the rich content which spans from architecture, development tools and programming models, such as OpenCL and OpenMP, to debugging tools. It also covers various multicore audio and image applications in detail.  Additionally, this one-of-a-kind book is supplemented with:

  • A rich set of tested laboratory exercises and solutions
  • Audio and Image processing applications source code for the Code Composer Studio (integrated development environment from Texas Instruments)
  • Multiple tables and illustrations

With no other book on the market offering any coverage at all on the subject and its rich content with twenty chapters, Multicore DSP: From Algorithms to Real-time Implementation on the TMS320C66x SoC is a rare and much-needed source of information for undergraduates and postgraduates in the field that allows them to make real-time applications work in a relatively short period of time. It is also incredibly beneficial to hardware and software engineers involved in programming real-time embedded systems.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 564

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Preface

Acknowledgements

Foreword

About the Companion Website

1 Introduction to DSP

1.1 Introduction

1.2 Multicore processors

1.3 Key applications of high‐performance multicore devices

1.4 FPGAs, Multicore DSPs, GPUs and Multicore CPUs

1.5 Challenges faced for programming a multicore processor

1.6 Texas Instruments DSP roadmap

1.7 Conclusion

References

2 The TMS320C66x architecture overview

2.1 Overview

2.2 The CPU

2.3 Single instruction, multiple data (SIMD) instructions

2.4 The KeyStone memory

2.5 Peripherals

2.6 Conclusion

References

3 Software development tools and the TMS320C6678 EVM

3.1 Introduction

3.2 Software development tools

3.3 Hardware development tools

3.4 Laboratory experiments based on the C6678 EVM: introduction to Code Composer Studio (CCS)

3.5 Loading different applications to different cores

3.6 Conclusion

References

4 Numerical issues

4.1 Introduction

4.2 Fixed‐ and floating‐point representations

4.3 Dynamic range and accuracy

4.4 Laboratory exercise

4.5 Conclusion

References

5 Software optimisation

5.1 Introduction

5.2 Hindrance to software scalability for a multicore processor

5.3 Single‐core code optimisation procedure

5.4 Interfacing C with intrinsics, linear assembly and assembly

5.5 Assembly optimisation

5.6 Software pipelining

5.7 Linear assembly

5.8 Avoiding memory banks

5.9 Optimisation using the tools

5.10 Laboratory experiments

5.11 Conclusion

References

6 The TMS320C66x interrupts

6.1 Introduction

6.2 The interrupt controller

6.3 Laboratory experiment

6.4 Conclusion

References

7 Real‐time operating system: TI‐RTOS

7.1 Introduction

7.2 TI‐RTOS

7.3 Real‐time scheduling

7.4 Dynamic memory management

7.5 Laboratory experiments

7.6 Conclusion

References

8 Enhanced Direct Memory Access (EDMA3) controller

8.1 Introduction

8.2 Type of DMAs available

8.3 EDMA controllers architecture

8.4 Parameter RAM (PaRAM)

8.5 Transfer synchronisation dimensions

8.6 Simple EDMA transfer

8.7 Chaining EDMA transfers

8.8 Linked EDMAs

8.9 Laboratory experiments

8.10 Conclusion

References

9 Inter‐Processor Communication (IPC)

9.1 Introduction

9.2 Texas Instruments IPC

9.3 Notify module

9.4 MessageQ

9.5 ListMP module

9.6 GateMP module

9.7 Multi‐processor Memory Allocation: HeapBufMP, HeapMemMP and HeapMultiBufMP

9.8 Transport mechanisms for the IPC

9.9 Laboratory experiments with KeyStone I

9.10 Laboratory experiments with KeyStone II

9.11 Conclusion

References

10 Single and multicore debugging

10.1 Introduction

10.2 Software and hardware debugging

10.3 Debug architecture

10.4 Advanced Event Triggering

10.5 Unified Instrumentation Architecture

10.6 Debugging with the System Analyzer tools

10.7 Instrumentation with TI‐RTOS and CCS

10.8 Laboratory sessions

10.9 Conclusion

References

11 Bootloader for KeyStone I and KeyStone II

11.1 Introduction

11.2 How to start the boot process

11.3 The boot process

11.4 ROM Bootloader (RBL)

11.5 Boot process

11.6 Laboratory experiment 1

11.7 Laboratory experiment 2

11.8 TFTP boot with a host‐mounted Network File System (NFS) server – NFS booting

11.9 Conclusion

References

12 Introduction to OpenMP

12.1 Introduction to OpenMP

12.2 Directive formats

12.3 Forking region

12.4 Work‐sharing constructs

12.5 Environment variables and library functions

12.6 Synchronisation constructs

12.7 OpenMP accelerator model

12.8 Laboratory experiments

12.9 Conclusion

References

13 Introduction to OpenCL for the KeyStone II

13.1 Introduction

13.2 Operation of OpenCL

13.3 Command queue

13.4 Kernel declaration

13.5 How do the kernels access data?

13.6 OpenCL memory model for the KeyStone

13.7 Synchronisation

13.8 Basic debugging profiling

13.9 OpenMP dispatch from OpenCL

13.10 Building the OpenCL project

13.11 Laboratory experiments

13.12 Conclusion

References

14 Multicore Navigator

14.1 Introduction

14.2 Navigator architecture

14.3 Complete functionality of the Navigator

14.4 Laboratory experiment

14.5 Conclusion

References

15 FIR filter implementation

15.1 Introduction

15.2 Properties of an FIR filter

15.3 Design procedure

15.4 Laboratory experiments

15.5 Conclusion

References

16 IIR filter implementation

16.1 Introduction

16.2 Design procedure

16.3 Coefficients calculation

16.4 IIR filter implementation

16.5 Laboratory experiment

16.6 Conclusion

Reference

17 Adaptive filter implementation

17.1 Introduction

17.2 Mean square error

17.3 Least mean square

17.4 Implementation of an adaptive filter using the LMS algorithm

17.5 Implementation using linear assembly

17.6 Implementation in C language with compiler switches

17.7 Laboratory experiment

17.8 Conclusion

References

18 FFT implementation

18.1 Introduction

18.2 FFT algorithm

18.3 FFT implementation

18.4 Laboratory experiment

18.5 Conclusion

References

19 Hough transform

19.1 Introduction

19.2 Theory

19.3 Limits of

r

and θ

19.4 Hough transform implementation

19.5 Laboratory experiment

19.6 Conclusion

References

20 Stereo vision implementation

20.1 Introduction

20.2 Algorithm for performing depth calculation

20.3 Cost functions

20.4 Implementation

20.5 Conclusion

References

Index

End User License Agreement

List of Tables

Chapter 01

Table 1.1 Top 10 supercomputers, November 2016 [9]

Table 1.2 Pros and cons of multicore SoCs and FPGA SoCs [12]

Table 1.3 Main TI family of embedded processors

Chapter 02

Table 2.1 KeyStone I family

Table 2.2 KeyStone II family

Table 2.3 Possible 40‐/64‐bit register pair combinations

Table 2.4 Possible 128‐bit register pair combinations

Table 2.5 Fixed‐point multiplications per unit

Table 2.6 Floating‐point multiplications per unit

Table 2.7 SIMD examples

Table 2.8 TMS320C66x control registers [5]

Table 2.9 Local L2 memory for all TMS320C6678 cores

Table 2.10 Timer modes

Chapter 03

Table 3.1 Common compiler options

Table 3.2 Common assembler options

Table 3.3 Frequently used options for the linker

Chapter 04

Table 4.1 Examples of fixed‐ and floating‐point applications

Table 4.2 4‐bit unsigned integer numbers

Table 4.3 4‐bit signed integer numbers

Table 4.4 4‐bit fractional numbers

Table 4.5 Special numbers

Table 4.6 Numerical format used for the KeyStone

Chapter 05

Table 5.1 Optimisation levels of the optimising compiler

Table 5.2 Parser and optimiser options summary

Table 5.3 Registers use

Table 5.4 Iteration interval table for an FIR filter

Table 5.5 Iteration interval table for an FIR filter

Table 5.6 Iteration interval table for an FIR filter

Table 5.7 Different sections of the code

Table 5.8 Resource allocation

Table 5.9 Register allocation

Table 5.10 Scheduling table

Table 5.11 Scheduling table

Table 5.12 Scheduling table

Chapter 06

Table 6.1 Interrupt sources and priority

Table 6.2 CIC0 event inputs (secondary interrupts for TMS320C66x CorePacs) [1]

Table 6.3 Memory location of the CIC0 and CIC1 for the TMS320C6678 [1]

Table 6.4 CIC register offsets

Chapter 07

Table 7.1 Swi APIs

Table 7.2 Comparison of thread characteristics for the KeyStone devices [1]

Chapter 08

Table 8.1 Events associated with the EDMA3CC0 for the TMS320C6678

Table 8.2 Events associated with the EDMA3CC1 for the TMS320C6678 [2]

Table 8.3 Events associated with the EDMA3CC2 for the TMS320C6678 [3]

Chapter 09

Table 9.1 Modules used by the IPC [1]

Table 9.2 Main

MessageQ

functions

Table 9.3 Functions for the

ListMP

module

Table 9.4 Local and remote protection levels

Table 9.5 Example showing how to use

GateMP

Chapter 10

Table 10.1 Trace actions description

Table 10.2

xdc.runtime

package

Table 10.3 Different categories available

Table 10.4 RTOS Analyzer functions

Table 10.5 System Analyzer commands

Table 10.6 Comparison between a few debugging techniques

Chapter 11

Table 11.1 Reset types for the KeyStone I [3]

Table 11.2 Reset types for the KeyStone II [1]

Table 11.3 Reset types summary for both KeyStone I and II

Table 11.4 Boot Parameter Table Common Parameters (TMS320C6678) [3]

Table 11.5 EMIF16 boot mode parameter table (TMS320C6678) [3]

Table 11.6 SPI device configuration bit fields

Table 11.7 SPI device configuration field descriptions (TMS320CC6678) [3]

Table 11.8 EMIF16 boot parameter table common parameters (66AK2H14/12/06)

Table 11.9 EMIF16 boot parameter table (TMS320C66AK2H14/12/06)

Table 11.10 SPI boot parameter table (TMS32066AK2H14/12/06) [1]

Table 11.11 Bootloader Section in L2 SRAM

Table 11.12 Boot mode for the C6678 EVM [7]

Table 11.13 Boot mode switches

Table 11.14 Steps required for booting the EVM

Table 11.15 KeyStone II boot examples [11]

Chapter 12

Table 12.1 Clauses and descriptions

Table 12.2 Pragmas where the clauses can be used

Table 12.3 Environment variables used

Chapter 13

Table 13.1 Some device information

Table 13.2 Different levels for the

callback

function

Table 13.3

getProfilingInfo

parameters

Chapter 14

Table 14.1 PKDMA channel map

Table 14.2 Rx DMA channel Config region registers

Table 14.3 Host packet descriptor layout

Table 14.4 Host buffer descriptor layout

Table 14.5 Description memory setup region registers [10]

Table 14.6 Peek registers

Table 14.7 Queue map for the KeyStone I

Table 14.8 Queue map for the KeyStone II

Chapter 15

Table 15.1 Conditions for linear phase

Table 15.2 Window features

Table 15.3 FIR coefficients converted into Q15 format

Chapter 18

Table 18.1 DFT calculation for every frequency bin

Chapter 20

Table 20.1 Most common cost functions

List of Illustrations

Chapter 01

Figure 1.1 The impact of the serial code that cannot be parallelised on the performance.

Figure 1.2 Amdahl’s law [7].

Figure 1.3 The inter‐processor communication effect.

Figure 1.4 Example where three cores can perform the task required by the parallel code.

Figure 1.5 Example where cores are processing different algorithms.

Figure 1.6 Example when serial code and parallel code are processed simultaneously.

Figure 1.7 Texas Instruments DSPs.

Figure 1.8 Performance comparison.

Figure 1.9 Texas Instruments DSP roadmap (courtesy Texas Instruments).

Chapter 02

Figure 2.1 Texas Instruments (TI) digital signal processor (DSP) roadmap.

Figure 2.2 KeyStone I architecture [3].

Figure 2.3 KeyStone II architecture [4].

Figure 2.4 TMS320C66x CPU block diagram.

Figure 2.5 TMS320C66x CPU data path and control.

Figure 2.6 Address cross paths.

Figure 2.7 Instructions completing in the same cycle.

Figure 2.8 Four‐way SIMD operation.

Figure 2.9 Simplified memory structure for KeyStone.

Figure 2.10 Memory structure, including the MPAX for KeyStone.

Figure 2.11 Example of cores accessing their local or other local memories.

Figure 2.12 Example showing the use of MPAX.

Figure 2.13 Memory topology for the TMS320C6678.

Chapter 03

Figure 3.1 Hardware and software development tools.

Figure 3.2 Texas Instruments’ software ecosystem [3].

Figure 3.3 Basic development tools.

Figure 3.4 Excerpt of command file for the TMS320C6678 EVM (

C6678.cmd

).

Figure 3.5 RTSC tools [6].

Figure 3.6 Entering a linker command file.

Figure 3.7 Platform selection.

Figure 3.8 Creating a new platform.

Figure 3.9 Selecting the device family and device name for the new platform.

Figure 3.10 Device page.

Figure 3.11 How to modify the new platform.

Figure 3.12 Output when a successful platform is generated.

Figure 3.13 Selecting the new platform for the project.

Figure 3.14 Multicore Software Development Kit (MCSDK) [9, 10].

Figure 3.15 The TMS320C6678 and the KeyStone II EVMs. (a) TMS320C6678 EVM without and with an emulator; (b) KeyStone II EVM without and with an emulator.

Figure 3.16 EVM layout. (a) TMS320C6678L [14]; (b) KeyStone II [15].

Figure 3.17 Code Composer Studio (CCS).

Figure 3.18 The TMS320C6678 EVM.

Figure 3.19 CCS download page.

Figure 3.20 Registration with myTI.

Figure 3.21 CCS download.

Figure 3.22 CCS starting window.

Figure 3.23 Selecting a workspace location.

Figure 3.24 Lab1 basic project settings.

Figure 3.25 Lab1 advanced project settings.

Figure 3.26 Default view.

Figure 3.27 Perspective selector.

Figure 3.28 Naming a target configuration.

Figure 3.29 Selecting the appropriate emulator for the target configuration.

Figure 3.30 Selecting the appropriate target configuration.

Figure 3.31 Selecting the Project Explorer.

Figure 3.32 Building a project.

Figure 3.33 Building and debugging.

Figure 3.34 Changing the configuration option.

Figure 3.35 Build types.

Figure 3.36 Launching the

Debug

session.

Figure 3.37 Running the project.

Figure 3.38 View functions.

Figure 3.39

dotp.c

: Source code to be completed.

Figure 3.40 Clock icon and cycle count.

Figure 3.41 Clock setup.

Figure 3.42 Clock setup.

Figure 3.43 Clock setup.

Figure 3.44 Selecting

Debug

configurations.

Figure 3.45 Setting the

Debug

configuration.

Figure 3.46 Setting the device.

Figure 3.47 Setting the project location.

Figure 3.48 Setting

Core 2

.

Figure 3.49 Setting for the second project.

Figure 3.50 Grouping the cores.

Figure 3.51 Grouping the cores.

Figure 3.52 Console output.

Chapter 04

Figure 4.1

N

‐bit fractional number representation.

Figure 4.2 Binary multiplication of two fractional numbers.

Figure 4.3 15‐bit * 15‐bit resulting in

Q

30 and

Q

15 formats.

Figure 4.4 32‐bit IEEE standard format.

Figure 4.5 64‐bit IEEE standard format.

Figure 4.6 Accuracy of the 32‐bit floating‐point number.

Figure 4.7 Variables before running the code.

Figure 4.8 Final results.

Chapter 05

Figure 5.1 Optimisation flow procedure.

Figure 5.2 Illustration of the different stages of the optimising compiler.

Figure 5.3

dotpsa2

file.

Figure 5.4 Viewing register pairs.

Figure 5.5 Interfacing C and assembly.

Figure 5.6 Automatic and manual saving of registers.

Figure 5.7 Dependency graph of an FIR filter.

Figure 5.8 Dependency graph of an FIR filter.

Figure 5.9 Dependency graph of an FIR filter.

Figure 5.10 Final dependency graph of an FIR filter.

Figure 5.11

DOTP2

instruction.

Figure 5.12 Dependency graph of the

dotp

function using

dotp2

instructions.

Figure 5.13

dotp

function implemented with

dotp2

instructions.

Figure 5.14 dotp4h instruction functionality.

Figure 5.15 Dependency diagram for the

dotp

function.

Figure 5.16

dotp

implemented with

dotp4h

instructions.

Figure 5.17

DDOTP4H

instruction.

Figure 5.18 Dependency diagram for the

dotp

function using ddotp4h instruction.

Figure 5.19

dotp

implemented with

ddotp4h

instructions.

Figure 5.20 Illustration of the

DDOTP4H

instruction.

Figure 5.21 Dependency for the

DDTOP4h.sa

algorithm.

Figure 5.22

ddotp4h2.sa

.

Figure 5.23 TMS320C66x memory banks.

Figure 5.24

dotp

code to optimise.

Figure 5.25 Load balancing by unrolling a loop.

Figure 5.26 Loop‐carried dependency graph.

Figure 5.27 Loop‐carried dependency graph.

Figure 5.28 Console output showing all results.

Figure 5.29 Dependency graph.

Chapter 06

Figure 6.1 Interrupt response procedure.

Figure 6.2 Various interrupts available.

Figure 6.3 Interrupts: the big picture.

Figure 6.4 CIC controllers for the TMS32C6678 [1].

Figure 6.5 CIC controller for the 66AK2H14/12 [2].

Figure 6.6 The enabler functionality.

Figure 6.7 System Interrupt Status Indexed Set Register (

STATUS_SET_INDEX_REG

).

Figure 6.8 Example of channel mapping.

Figure 6.9 Default mapping.

Figure 6.10 Host interrupt mapping for the CIC0 viewed with the CCS.

Figure 6.11 Host Interrupt Map Registers [3].

Figure 6.12 Configuration script.

Figure 6.13 Accessing the event combiner.

Figure 6.14 Selecting Group 2 to generate Interrupt 6.

Figure 6.15 Enabling Events 89 and 90.

Figure 6.16 Interrupt controller.

Figure 6.17 Event combiner.

Figure 6.18 Overall functionality of the interrupt mechanism.

Figure 6.19 Experimental setup.

Figure 6.20 Console output showing the functions called.

Chapter 07

Figure 7.1 TI‐RTOS components.

Figure 7.2 TI‐RTOS kernel.

Figure 7.3 Various threads available for the TI‐RTOS.

Figure 7.4 SYS/BIOS configuration file.

Figure 7.5 Hwi module settings.

Figure 7.6 Hwi instance settings.

Figure 7.7 Configuration script generated.

Figure 7.8 Timers events IDs [1].

Figure 7.9 Definition of a hook set.

Figure 7.10 Definition of a hook set with only two elements.

Figure 7.11 Configuration code setting two Hwi hook sets.

Figure 7.12 C code defining the hook functions.

Figure 7.13 Program counter in

main()

.

Figure 7.14 Output showing

myRegister1_HWI()

and

myRegister2_HWI()

run before the application reaches

main()

.

Figure 7.15 Breakpoint set in

myHWI

.

Figure 7.16 Setting the breakpoint counter to 3.

Figure 7.17 Output showing the hook functions running before and after the Hwi.

Figure 7.18 A task structure.

Figure 7.19 Setting the number of priorities for the tasks.

Figure 7.20 Task modes.

Figure 7.21 Selecting semaphores for setups.

Figure 7.22 Selecting a semaphore module.

Figure 7.23 Instance settings.

Figure 7.24 TI‐RTOS kernel.

Figure 7.25 Adding an event module.

Figure 7.26 Event instance settings.

Figure 7.27 Event instance settings generated.

Figure 7.28 A task synchronised by events.

Figure 7.29 Hwi posting two events.

Figure 7.30 Example of an external memory fragmentation.

Figure 7.31 Setting the

HeapMem

.

Figure 7.32

HeapMem

instance settings.

Figure 7.33 Code generated from Figure 7.32.

Figure 7.34 Setting

myHeap

section to be in the DDR.

Figure 7.35 Setting the default heap.

Figure 7.36

HeapBuf

with fixed blocks.

Figure 7.37 Selecting the

HeapBuf

for configuration.

Figure 7.38 Configuration of the

HeapBuf

.

Figure 7.39 Script obtained from Figure 7.38.

Figure 7.40

myHeapSection

allocation in DDR3.

Figure 7.41 Code generated from Figure 7.40.

Figure 7.42 Actual memory allocation.

Figure 7.43 Configuration code for testing

HeapMultiBuf

.

Figure 7.44 C code for testing

HeapMultiBuf

.

Figure 7.45 Timing of the

clk0Fxn

function.

Figure 7.46 Console output.

Figure 7.47 Output of project.

Figure 7.48 Sequence of events required.

Figure 7.49 Code to uncomment.

Figure 7.50 Select the Swi for configuration.

Figure 7.51 Setting of

swi0

.

Figure 7.52 Setting of

swi1

.

Figure 7.53 Setting the configuration for the clock.

Figure 7.54 Setting

clock0

.

Figure 7.55 Setting

clock1

.

Figure 7.56 Setting the clock.

Figure 7.57 Setting the configuration for the semaphore.

Figure 7.58 Adding a semaphore.

Figure 7.59 Setting the semaphore.

Figure 7.60 Setting the configuration for the tasks.

Figure 7.61 Warning that the tasks are not enabled.

Figure 7.62 Enabling tasks.

Figure 7.63 Adding threads modules.

Figure 7.64 Creating

task0

.

Figure 7.65 Creating

task1

.

Figure 7.66 Setting the configuration for the timer.

Figure 7.67 Adding a timer instance.

Figure 7.68 Configuring the timer.

Figure 7.69 Setting the configuration for the Hwi.

Figure 7.70 Setting

hwi1

.

Figure 7.71 Console output.

Figure 7.72 Console output when both events occur.

Figure 7.73 How to post an event.

Figure 7.74 Buffer locations.

Figure 7.75 Setting a breakpoint and checking the memory allocation.

Figure 7.76 Error generated since all memory is in use.

Chapter 08

Figure 8.1 TMS320C66AK2H12 functional block diagram.

Figure 8.2 TMS32C6678 functional block diagram.

Figure 8.3 DMA and QDMA within an EDMA.

Figure 8.4 DMA channels for the TMS320C6678.

Figure 8.5 IDMA Channel 0 and Channel 1 functions.

Figure 8.6 EDMA controller.

Figure 8.7 EDMA3 channel controller (EDMA3CC).

Figure 8.8 TMS320C6678 EDMA3 events.

Figure 8.9 Transfer controllers.

Figure 8.10 The four levels of EDMA prioritisation [4].

Figure 8.11 Trigger source priority.

Figure 8.12 Parameter Ram (PaRam).

Figure 8.13 Channel options parameter (OPT).

Figure 8.14 Transfer configuration.

Figure 8.15 A – Synchronisation.

Figure 8.16 AB – Synchronisation.

Figure 8.17 Simple EDMA transfer.

Figure 8.18 One block transfer.

Figure 8.19 Chaining two EDMAs: example.

Figure 8.20 Early and normal transfer triggers.

Figure 8.21 Linked EDMA (a) before Channel 1 completes the transfer and (b) after Channel 1 completes the transfer.

Figure 8.22 Console output showing the OPT fields.

Figure 8.23 Console showing the data have been transferred.

Figure 8.24 Channel 1 and Channel 2 configurations.

Figure 8.25 Console showing initialisation of the source arrays.

Figure 8.26 Console showing the data have been transferred.

Figure 8.27 Console output showing the OPT fields.

Figure 8.28 Console showing the data have been transferred.

Chapter 09

Figure 9.1 Shared memory model.

Figure 9.2 Example of a message queue mechanism.

Figure 9.3 Features for the shared memory and message queue IPCs.

Figure 9.4 Modules used by the IPC.

Figure 9.5 Components dependency.

Figure 9.6

Notify

module functionality.

Figure 9.7

notify_loopback.c

file.

Figure 9.8 Output console.

Figure 9.9 Example of a

MessageQ

sender/receiver topology.

Figure 9.10 Message priority settings.

Figure 9.11 Priority illustration.

Figure 9.12 Synchronisation between the writer and the reader.

Figure 9.13 Illustration of synchronisation when using

Swi

s.

Figure 9.14 How

GateMP

is used.

Figure 9.15 Allocation and using a shared memory.

Figure 9.16 Shared memory transport.

Figure 9.17 SRIO transport.

Figure 9.18 Properties used for the

IpcSharedMem

project.

Figure 9.19 Tools used for the

IpcSharedMem

project.

Figure 9.20 Grouping the cores.

Figure 9.21 All cores grouped.

Figure 9.22 Output console.

Figure 9.23 IPC API references.

Figure 9.24 Grouping

Core0

and

Core1

.

Figure 9.25 Console output.

Figure 9.26 Illustration of the IPC communication using the

MessageQ

.

Figure 9.27 KeyStone II EVM setup.

Figure 9.28 Device manager for identifying the

COM

ports.

Figure 9.29 Setting up the

COM

ports.

Figure 9.30 VMware.

Figure 9.31 Connecting the flash drive.

Figure 9.32 Establishing the connection between the PC and the EVM.

Figure 9.33 Edit connections.

Figure 9.34 Choose a connection type.

Figure 9.35 Select a MAC address and a connection name.

Figure 9.36 Select

Shared to other computers

.

Figure 9.37 Output when the connection is made.

Figure 9.38 Use

ifconfig

to check the IP addresses.

Figure 9.39

PuTTY

terminal after booting.

Figure 9.40 Finding the IP address for EVM.

Figure 9.41 Setting up FileZilla.

Figure 9.42 Importing a file.

Figure 9.43 Select

Existing Projects into Workspace

.

Figure 9.44 Selecting the root directory.

Figure 9.45 Building a project.

Figure 9.46 Console output.

Figure 9.47 Importing a Code Composer Studio project.

Figure 9.48 Selecting the directory.

Figure 9.49 Building a project.

Figure 9.50 Console output.

Figure 9.51 Transferring the ARM code.

Figure 9.52 Transferring the DSP code.

Figure 9.53 Transferring the

load.sh

.

Figure 9.54 Changing the permission of a file.

Figure 9.55 Console output.

Figure 9.56 Accessing files using FileZilla.

Figure 9.57 Output after edge detection.

Figure 9.58 Console output.

Figure 9.59 Output after edge detection.

Chapter 10

Figure 10.1 Trace functions available with the standard trace.

Figure 10.2 How to select when to start and/or stop tracing.

Figure 10.3 Using a function name as the starting location for a capture.

Figure 10.4 Trace range specified with a function name and range (32 bytes) for this example.

Figure 10.5 Captured data when using

Trace in Range

.

Figure 10.6 Setting the trace variable.

Figure 10.7 Trace output after setting the trace variable.

Figure 10.8 Store sample configuration.

Figure 10.9 Store sample configuration example.

Figure 10.10 Event trace with stall.

Figure 10.11 Event trace with memory.

Figure 10.12 KeyStone CP tracer modules [8].

Figure 10.13 Using the CP tracer.

Figure 10.14 Debug sub‐system.

Figure 10.15 TMS320C66x debug architecture [8].

Figure 10.16 Bugged code example that can be detected using state sequencing.

Figure 10.17 AET logic.

Figure 10.18 System libraries.

Figure 10.19 UBM available functions.

Figure 10.20 Error generated when asking for more resources.

Figure 10.21 UIA components.

Figure 10.22 Events functions provided by the UIA.

Figure 10.23

Log_Event

description [10].

Figure 10.24 Multicore Software Development Kit (MCSDK).

Figure 10.25 The

xdc.runtime

package and its modules [12].

Figure 10.26 RTOS Analyzer features available.

Figure 10.27 System Analyzer additional functions.

Figure 10.28 A test code with a

main()

, a

task()

and two

Swi

s functions.

Figure 10.29 Using

SysMin

to display

System_printf

outputs.

Figure 10.30 Selecting the

task knl

to observe the tasks.

Figure 10.31 Selecting the

Swi knl

to observe the

Swi

s.

Figure 10.32 RTSC configuration used in this project.

Figure 10.33 Test code.

Figure 10.34 (a) Opening the UIA dialogue box. (b) Configuring the UIA.

Figure 10.35 Available commands used for the RTOS Analyzer.

Figure 10.36 Analysis configuration.

Figure 10.37 Adding compiler options.

Figure 10.38 Output of the logged data.

Figure 10.39 Filtering the display message.

Figure 10.40 Configuration used in this project.

Figure 10.41 Source code for Laboratory experiment 2.

Figure 10.42 RTSC configuration used in this project.

Figure 10.43 Selecting options for user‐written software.

Figure 10.44 Analysis configuration.

Figure 10.45 Duration graph showing the time to execute each function (without optimisation).

Figure 10.46 Execution graph showing the order in which functions were executed.

Figure 10.47 Duration graph showing the time to execute each function (with

‐o3

optimisation).

Figure 10.48 Source code.

Figure 10.49 Configuration code.

Figure 10.50 Using the

Main.common$.diags_INFO = Diags.ALWAYS_ON

.

Figure 10.51 Using the

Main.common$.diags_INFO = Diags.ALWAYS_OFF

.

Figure 10.52 Using the

Main.common$.diags_INFO = Diags.RUNTIME_OFF

or

Main.common$.diags_INFO = Diags.RUNTIME_ON

.

Figure 10.53 Application code.

Figure 10.54 Configuration file.

Figure 10.55 Console output showing USER2 and all levels above three.

Figure 10.56 Configuration file using USER1 and LEVEL2.

Figure 10.57 Output file.

Chapter 11

Figure 11.1 Boot address of a CorePac,

DSP_BOOT_ADDR

, for the KeyStone I.

Figure 11.2 Boot address of a CorePac,

DSP_BOOT_ADDR

, for the KeyStone II.

Figure 11.3 High‐level overview of the boot process for the KeyStone I [4].

Figure 11.4 High‐level overview of the boot process for the KeyStone II [2].

Figure 11.5 Boot mode configuration switches for the TMS320C6678 EVM.

Figure 11.6 Boot mode configuration switches for the KeyStone II EVM.

Figure 11.7 Boot mode pin for the TMS320C6678 (DEVSTAT) [3].

Figure 11.8 Device status register for the TMS320C66AK2H14/12/06 [1].

Figure 11.9 Boot mode pins for the TMS320C66AK2H14/12/06 (DEVSTAT) [1].

Figure 11.10 DEVSTAT boot mode pins ROM mapping.

Figure 11.11 SPI device configuration field descriptions.

Figure 11.12 The RBL boot process.

Figure 11.13 Detailed RBL boot process for the TMS320C6678.

Figure 11.14 The boot table.

Figure 11.15 Modifying the boot configuration table.

Figure 11.16 Boot complete register,

BCx

.

Note

:

BCx CorePacx

boot status: 0 = 

CorePacx

boot NOT complete, 1 = 

CorePacx

boot complete.

Figure 11.17 IBL location.

Note

: In Rev 1.0 of the C6670 EVM, the FPGA is programmed to invoke the IBL in order to execute the PLL fix, and then jump right back to RBL which continues the process. See the reference for the IBL update [6].

Figure 11.18 IBL boot modes.

Figure 11.19 IBL configuration.

Figure 11.20 TMS320C6678 EVM memory layout.

Figure 11.21 File locations.

Figure 11.22 Output after running the

build.bat

file.

Figure 11.23 No boot mode switches.

Figure 11.24 Loading the GEL file.

Figure 11.25 Loading the image.

Figure 11.26 Loading the memory.

Figure 11.27 Entering the information for the memory block to be loaded.

Figure 11.28 Output when the NOR is flashed properly.

Figure 11.29 ROM SPI boot mode.

Figure 11.30 The boot sequence for the KeyStone II EVM.

Figure 11.31 EVM connection to the PC.

Figure 11.32 EVMK2H hardware.

Figure 11.33 Device manager for identifying the COM ports.

Figure 11.34 Setting up the COM ports.

Figure 11.35 IP addresses used in this experiment.

Figure 11.36 Accessing the network settings.

Figure 11.37 Setting the network connection to Bridged.

Figure 11.38 Setting the IP address of VMware.

Figure 11.39 Setting the host PC IP address.

Figure 11.40 Console output after booting Linux.

Figure 11.41 Setting

eth0

’s IP address to 192.168.2.5.

Figure 11.42 Accessing the EVM from VMware.

Figure 11.43 Using the BMC to verify the EVM boot mode selected.

Figure 11.44 Using

ipconfig

to check the IP addresses.

Figure 11.45 Setting the

eth0

to IP address 192.168.2.105.

Figure 11.46 Monitor showing the file system.

Figure 11.47 Host‐mounted NFS server.

Figure 11.48 EVM setup.

Figure 11.49 Creating directory to hold the TI SDK Linux root file system.

Figure 11.50 Check the COM ports.

Figure 11.51 Configure the COM ports.

Figure 11.52 Power up the EVM and abort the boot.

Figure 11.53 Setting the Ubuntu Ethernet connection.

Figure 11.54 Restarting the server.

Figure 11.55 Booting the EVM.

Figure 11.56 EVM booted.

Figure 11.57 Finding the EVM IP address.

Figure 11.58 Finding the IP address for Ubuntu.

Figure 11.59 Connect to the EVM and Ubuntu using FileZilla.

Figure 11.60 Creating a directory.

Figure 11.61 Terminal showing the created directory

test0

.

Chapter 12

Figure 12.1 The three main components of OpenMP.

Figure 12.2 Structure of OpenMP.

Figure 12.3 Illustration of forking.

Figure 12.4 Output console.

Figure 12.5

private()

and

firstprivate()

examples.

Figure 12.6 Console output (using

private

and

firstprivate

).

Figure 12.7 Using

reduction

.

Figure 12.8 Using static scheduling.

Figure 12.9 Dynamic scheduling.

Figure 12.10 Guided scheduling.

Figure 12.11 Output showing the three scheduling kinds (types) with small iteration number.

Figure 12.12 Output showing the three scheduling kinds (types) with large iteration number.

Figure 12.13 Using

omp sections

.

Figure 12.14 Example with and without

task

directives.

Figure 12.15 Output when using

task

directive.

Figure 12.16 Output when not using

task

directive.

Figure 12.17 Example with OpenMP task.

Figure 12.18 Using

task

and

taskwait

.

Figure 12.19 Output console for four runs.

Figure 12.20 Console output.

Figure 12.21 Console output.

Figure 12.22 Console output.

Figure 12.23 Console output.

Figure 12.24 Using

#pragma omp target

.

Figure 12.25 Using the

map

clause.

Figure 12.26 Example when data do need to be processed by the host.

Figure 12.27 Enabling OpenMP.

Figure 12.28 Group the cores before running the code.

Figure 12.29 Output console.

Figure 12.30 Remove the

Auto run option

.

Figure 12.31 Code using

omp section

.

Figure 12.32 Console outputs.

Figure 12.33 EVM connection to the PC.

Figure 12.34 Identifying ports used by the EVM.

Figure 12.35 Setting up the COM port 14.

Figure 12.36 Setting up the COM port 16.

Figure 12.37 Evoking the VMware.

Figure 12.38 The VMware.

Figure 12.39 Edit connections.

Figure 12.40 Select an Ethernet connection or add one.

Figure 12.41 Create an Ethernet connection.

Figure 12.42 Select Ethernet.

Figure 12.43 Edit the

Connection name

.

Figure 12.44 Select

IPv4 settings

.

Figure 12.45 Add the Ethernet created.

Figure 12.46 KeyStone II EVM booting.

Figure 12.47 Login as

root

.

Figure 12.48 Use

ifconfig

to check the IP addresses.

Figure 12.49 Starting FileZilla.

Figure 12.50 Using FileZilla to transfer files from the host to the EVM.

Figure 12.51 Selecting the files to be transferred to the EVM.

Figure 12.52 Changing the file permission.

Figure 12.53 Running the

dotp

project on the EVM.

Figure 12.54 Output console showing time consumed when code is running on the target or host.

Figure 12.55 Code running on the host.

Figure 12.56 OpenMP code.

Figure 12.57 The makefile used

Makefile

.

Figure 12.58 Console output showing the results.

Chapter 13

Figure 13.1 OpenCL platform model.

Figure 13.2 Host, compute devices and compute units.

Figure 13.3 Operation of OpenCL.

Figure 13.4 Context/platform used in this chapter.

Figure 13.5 Example of and application using OpenCL.

Figure 13.6 Example 1: Illustration of work item and workgroups.

Figure 13.7 Example 2: Illustration of work item and workgroups.

Figure 13.8 Example 3: Illustration of work item and workgroups.

Figure 13.9 How data are divided amongst work items.

Figure 13.10 KeyStone II memory map definition.

Figure 13.11 Memory not accessible by the device.

Figure 13.12 Data copy when using

CL_MEM_USE_HOST_PTR

and the data are allocated in the host.

Figure 13.13 No data copy when using

CL_MEM_USE_HOST_PTR

and the data are located by the host in the device memory.

Figure 13.14 Using

CL_MEM_ALLOC_HOST_PTR

.

Figure 13.15 Data copy using

CL_MEM_COPY_HOST_PTR

.

Figure 13.16 No barrier is used.

Figure 13.17 A barrier is used.

Figure 13.18 Hardware setup.

Figure 13.19 EVM connection to the PC.

Figure 13.20 Device manager for identifying the COM ports.

Figure 13.21 Setting up the COM ports.

Figure 13.22 VMware start‐up window.

Figure 13.23 VMware detected the USB device.

Figure 13.24 Edit connections.

Figure 13.25 Choose a connection type.

Figure 13.26 Select a MAC address and a connection name.

Figure 13.27 Select

Shared to other computers

.

Figure 13.28 Output when the connection is made.

Figure 13.29 Use

ifconfig

to check the IP addresses.

Figure 13.30 Boot process completed successfully.

Figure 13.31 Checking the IP address on the EVM.

Figure 13.32 Using FileZilla to transfer files from the host to the EVM.

Chapter 14

Figure 14.1 Functional block diagram for the C6678 [1].

Figure 14.2 Functional block diagram for the 66AK2H14 [2].

Figure 14.3 The Navigator architecture, simplified.

Figure 14.4 PKDMA within a peripheral.

Figure 14.5 PKDMA transmit side.

Figure 14.6 PKDMA receive side.

Figure 14.7 Tx Channel N Global Configuration Register A (0x000 + 32 × N) [10].

Figure 14.8 Infrastructure PKDMA.

Figure 14.9 Host packet descriptor structure: example.

Figure 14.10 Host and buffer packets linked.

Figure 14.11 Host buffer descriptor structure: example.

Figure 14.12 Monolithic descriptor.

Figure 14.13 Memory region indexing.

Figure 14.14 Configuring the memory region registers.

Figure 14.15 The base address of the QMSS configuration registers for the KeyStone I [1].

Figure 14.16 The base address of the QMSS configuration registers for the KeyStone II [2].

Figure 14.17 Code for reading the number of bytes and the number of descriptors in the queue.

Figure 14.18 QMSS architecture (KeyStone I).

Figure 14.19 QMSS architecture (KeyStone II).

Figure 14.20 Event management with the Navigator.

Figure 14.21 Accumulator interrupt generation for the KeyStone I.

Figure 14.22 Core‐to‐peripheral movement.

Figure 14.23 Core‐to‐core data movement using the Navigator.

Chapter 15

Figure 15.1 An arbitrary frequency response of an FIR filter showing the periodicity.

Figure 15.2 Filter specifications: (a) ideal, (b) practical.

Figure 15.3 Frequency response: (a) desired, (b) ideal.

Figure 15.4 Ideal frequency response of a low‐pass filter and (b) its impulse response.

Figure 15.5 Frequently used windows.

Figure 15.6 MATLAB program for generating the impulse response coefficients.

Figure 15.7 Plot of the filter coefficients

h

(

n

).

Figure 15.8 Transfer function of the designed filter.

Figure 15.9 Direct form structure for an FIR filter.

Figure 15.10 Impulse response,

h

(

n

), for (a)

N

odd and (b)

N

even.

Figure 15.11 Linear phase structure for (a)

N

even and (b)

N

odd.

Figure 15.12 Cascade structure.

Figure 15.13 PC‐to‐DSP connections.

Figure 15.14 USB and UART ports of the EVM.

Figure 15.15 C implementation of an FIR filter.

Figure 15.16 UART functional block diagram [8].

Figure 15.17 Illustration of the synchronisation mechanism.

Figure 15.18 FIR filter implementation in C language.

Figure 15.19 Java synchronisation code.

Figure 15.20 Setting a breakpoint before echoing the character ‘a’.

Figure 15.21 Setting the breakpoint property to

Refresh All Windows

.

Figure 15.22 Icon used for displaying the graph properties.

Figure 15.23 Graph priority for the input signal (using

R_in.graphProp

).

Figure 15.24 Graph priority for the filtered signal (using

R_out.graphProp

).

Figure 15.25 Java download location (using Windows x64 Offline) [9].

Figure 15.26 Importing an Eclipse project.

Figure 15.27 Opening an existing Eclipse project.

Figure 15.28 Importing a compressed Eclipse project.

Figure 15.29 Opening the Java source code.

Figure 15.30 Java window for controlling the input signals, the COM ports and the baud rate.

Figure 15.31 Default Java application.

Figure 15.32 Java application sending data to the UART.

Figure 15.33 Code Composer Studio console output.

Figure 15.34 Java window message when no data are lost.

Figure 15.35 Java output when data are lost.

Figure 15.36 Input signal display.

Figure 15.37 Filtered signal display.

Chapter 16

Figure 16.1

s

‐ to

z

‐plane mapping.

Figure 16.2 Relationship between the analogue and digital frequencies.

Figure 16.3 Relationship between the analogue and digital frequency responses when using the bilinear transform (BZT).

Figure 16.4 Transfer function of an IIR filter designed with the bilinear transform method.

Figure 16.5 Direct form I structure.

Figure 16.6 Direct form II canonical realisation.

Figure 16.7 Alternative to the direct form II realisation.

Figure 16.8 Cascade realisation using direct form II.

Figure 16.9 Frequency response of a second‐order filter using the impulse invariant method.

Figure 16.10 Direct form II structure.

Figure 16.11 C code for the implementation of an IIR filter.

Figure 16.12

dotp4

with dummy T0 and T00 values.

Figure 16.13 Updating

d02

and

d01

.

Figure 16.14 C code for the implementation of an IIR filter using SIMD instructions.

Figure 16.15 Performance comparison.

Figure 16.16 Build options.

Figure 16.17 Input signal (Single time −1) and the filtered signal (Single time −0).

Chapter 17

Figure 17.1 Basic block diagram of an adaptive filter.

Figure 17.2 Steps for implementing an LMS adaptive filter.

Figure 17.3 LMS algorithm in C language.

Figure 17.4 Coefficients and data storage.

Figure 17.5

DSMPY2

operation.

Figure 17.6 Illustration of the

packh2

instruction.

Figure 17.7 Data location before and after the shift.

Figure 17.8 Updating data.

Figure 17.9 Using the compiler switches to increase the performance.

Figure 17.10 Cycles consumed.

Figure 17.11 Graph properties settings.

Figure 17.12 Output signal for the C code (

Single Time ‐1

) and (

Single Time ‐0

) for the linear assembly.

Chapter 18

Figure 18.1 Twiddle factors for

N

 = 8.

Figure 18.2 Decimation in time (DIT) Radix 2 FFT.

Figure 18.3 Bit reversal.

Figure 18.4 Decimation in frequency (DIF) Radix 2 FFT.

Figure 18.5 Diagram of DIT Radix 2 FFT used for the implementation.

Figure 18.6 Flow graph of a butterfly.

Figure 18.7 Implementation of the butterfly.

Figure 18.8 Main loops for implementing a Radix 2 FFT.

Figure 18.9 Tasks to perform.

Figure 18.10 General configuration used.

Figure 18.11 Real‐Time Software Components (RTSC) tools used.

Figure 18.12 Code for generating sinewaves.

Figure 18.13 Buffer holding the input data.

Figure 18.14 Graph properties used (

Display_Time.graphProp

).

Figure 18.15 Display of the input data.

Figure 18.16 Buffer holding the magnitude

magg

.

Figure 18.17 Graph properties.

Figure 18.18 FFT display.

Figure 18.19 EDMA ping‐pong.

Figure 18.20 Properties for displaying the input data in the array

magg

.

Figure 18.21 Display output.

Figure 18.22 Properties for displaying the input data in the array

magg1

.

Figure 18.23 FFT magnitude display of data in

dstPong

.

Figure 18.24 Properties for displaying the input data in the array

magg2

.

Figure 18.25 FFT magnitude display of data in

dstPing

.

Chapter 19

Figure 19.1 Cartesian representation of a line.

Figure 19.2 (a) Polar and (b) Cartesian coordinates.

Figure 19.3 Example of points on a Cartesian representation and their corresponding polar representation.

Figure 19.4 Plot of

.

Figure 19.5 The

r

function is antisymmetric.

Figure 19.6 Diagram showing how to calculate the index

z

.

Figure 19.7 System to implement.

Figure 19.8 Generated test image.

Figure 19.9 Image property of 1152 × 648.

Figure 19.10 MATLAB code for generating the image header file.

Figure 19.11 Code for implementing the accumulator.

Figure 19.12 Code to extract the maxima of the accumulator.

Figure 19.13 Selecting the graphic display.

Figure 19.14 Changing, importing or exporting the properties of a graph.

Figure 19.15 Properties for the input image (in):

Image_in_properties.txt

.

Figure 19.16 Input image (in):

Image_in_properties.txt

.

Figure 19.17 Properties for the image after the edge detection (out):

Image_out_properties.txt

.

Figure 19.18 Image output after edge detection:

Image_out_properties.txt

.

Figure 19.19 Image properties for the accumulator:

accumulator_properties.txt

.

Figure 19.20 Section of the accumulator output:

accumulator_properties.txt

.

Figure 19.21 Values and coordinates of the five maxima.

Chapter 20

Figure 20.1 Computation of depth.

Figure 20.2 Reducing the ROI.

Figure 20.3 (a) Left image, (b) right image and (c) left and right images merged showing the disparities reduced.

Figure 20.4 Reduced disparity, line by line.

Figure 20.5 Estimation of disparity range from neighbouring pixels [7].

Figure 20.6 Corresponding pixels appear at different coordinates on the left and the right images.

Figure 20.7 Indices calculation.

Figure 20.8 Selecting the image display feature.

Figure 20.9 SAD implementation in C language.

Figure 20.10 SAD output image.

Figure 20.11 NCC implementation in C language.

Figure 20.12 NCC output image.

Figure 20.13 ZNCC implemented in C language.

Figure 20.14 ZNCC output image.

Figure 20.15 Time comparison between SAD, NCC and ZNCC with no optimisation.

Figure 20.16 Time comparison between SAD, NCC and ZNCC with optimisation (

−O3

).

Guide

Cover

Table of Contents

Begin Reading

Pages

iii

iv

v

xviii

xix

xx

xxi

xxii

xxiii

xxiii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

366

367

365

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

Multicore DSP

From Algorithms to Real‐time Implementation on the TMS320C66x SoC

 

 

Naim Dahnoun

University of Bristol

UK

 

 

 

 

 

 

 

 

This edition first published 2018© 2018 John Wiley & Sons Ltd

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Naim Dahnoun to be identified as the author of this work has been asserted in accordance with law.

Registered Office(s)John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USAJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial OfficeThe Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of WarrantyWhile the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication data applied for

ISBN: 9781119003823

Cover design by WileyCover image: © matejmo/Gettyimages

 

 

 

I dedicate this book to my childrenZahra, Yasmin and Riyadand in memory of my parents

Preface

Today’s many applications, such as medical, high‐end imaging, high‐performance computing and core networking, are facing increasing challenges in terms of data traffic, processing power and device‐to‐device communication. These put a high demand on the processor(s) and associated software and lead to processor manufacturers sustaining Moore’s law by introducing multicore processors. Texas Instruments, with its leading‐edge technology, introduced the multicore System‐on‐Chip (SoC) architecture family of processors to address these issues. As will be shown in this book, Texas Instruments introduced innovations at many levels, such as: powerful CPUs that support both fixed‐ and floating‐point arithmetic (instruction by instruction) that can achieve more than 40G multiplications/core, a Navigator that enables direct communication between cores and memory access that removes data movement bottlenecks, a Hyperlink interface and advanced development tools.

The challenge is not only how many cores you can put on a piece of silicon, the processing power of each core and how fast they can communicate, but also in the programming model and ease of use. Unfortunately, programming models are not developed sufficiently to handle several cores. The improvement in performance gained by the use of a multicore processor depends very much on the application and software used. C and C++, which are commonly used in embedded systems, do not support partitioning and, therefore, porting sequential code to multicore is not trivial. In this book, it will be shown this complexity is alleviated by using: OpenMP, which is an Application Programming Interface (API) that supports multiplatform shared multiprocessing programming in C, C++ and Fortran; Open Computing Language (OpenCL); or the Inter‐Processor Communication (IPC).

This book will help to innovate by making the reader understand the KeyStone SoC architectures, the development tools including debugging and various programming models with tested examples, and also help to broaden the knowledge by critically analysing each element (see Table of Contents) and understanding how these elements are working together. With the sheer number of practical examples and references provided, the reader will be able to quickly develop applications, take advantage of maximum performance and functionality of the processors, be able to easily use the tools to develop and debug applications and find the relevant references to pertinent material. Real‐time multicore audio and video applications are provided. Applications will be based on TI’s Multicore Software Development Kit (MCSDK), hand‐optimised code, OpenMP, OpenCL and IPC.

Due to the sheer amount of documentation available, some information is either referred to or reproduced to avoid discontinuity and misinterpretation.

This book is divided into 20 chapters. Chapters 1 to 15 deal with the hardware and software issues, and Chapters 16 to 20 deal with applications. Most of the concepts are backed up with laboratory experiments and demos that have been thoroughly tested.

Chapter 1 Introduction

: This introductory chapter provides the reader with general knowledge on multicore processors and their applications; gives a brief comparison between digital signal processor (DSP) SoCs, field‐programmable gate arrays (FPGAs), graphic processors and CPUs; illustrates the challenges associated with multicore; and provides an up‐to‐date TMS320 roadmap showing the evolution of TI’s DSP chips in terms of processing power.

Chapter 2 The TMS320C66x architecture overview

: This chapter comprehensively describes the TMS320C66x architecture. This includes a detailed description of the DSP CorePacs and an overview of the peripherals, and it introduces some useful instructions and an overview of the memory organisation.

Chapter 3 Software development tools and the TMS320C6678 EVM

: This chapter describes the software development tools that are required for testing the applications used in this book. It provides a step‐by‐step guide to the installation and use of the Code Composer Studio (CCS).

Chapter 4 Numerical issues

: This chapter explains how fixed and floating points are represented and how to handle binary arithmetic. It provides examples showing how to display various data formats using the CCS.

Chapter 5 Software optimisation

: This chapter discusses the different levels of optimisation for multicore and shows how code can be optimised for a DSP core. This chapter also shows how to use intrinsics and interface C language with intrinsics and assembly code. Multiple examples showing how to optimise code by hand and using the tools are provided.

Chapter 6 The TMS320C66x interrupts

: This chapter shows how the interrupt controller events and the Chip‐level Interrupt Controller work and how to program them to respond to events. The examples given use the general‐purpose input–output (GPIO) pins to provide the interrupts.

Chapter 7 Real‐time operating system: TI‐RTOS

: This chapter is divided into three main sections: (1) a real‐time scheduler that is composed of the hardware and software interrupts, the task, the idle, clock and timer functions, synchronisation and events; (2) dynamic memory management; and (3) laboratory experiments.

Chapter 8 Enhanced Direct Memory Access (EDMA3) Controller

: This chapter describes in detail the operation of the EDMA and provides examples with simple transfer, chaining transfer and linked transfer.

Chapter 9 Inter‐Processor Communication (IPC)

: This chapter explains the need for IPC and describes the notify module, the messageQ, the ListMP module, the Multi‐processor Memory Allocation, the transport mechanism and laboratory examples.

Chapter 10 Single and multicore debugging

: This chapter introduces the need for debugging and describes the debug architecture that includes trace, Advanced Event Triggering and the Unified Breakpoint Manager. This chapter also describes the Unified Instrumentation Architecture, debugging with the System Analyzer tools, instrumentation with TI‐RTOS and CCS and laboratory experiments.

Chapter 11 Bootloader for Keystone I and Keystone II

: This chapter introduces the boot process for both the KeyStone I and KeyStone II, and provides laboratory experiments for both devices.

Chapter 12 Introduction to OpenMP

: This chapter introduces the concept behind OpenMP and divides the content into three main sections: (1) work sharing, (2) data sharing and (3) synchronisation. Various examples with both KeyStone I and II are provided. For the KeyStone II, an example is implemented with OpenMP with the accelerator model.

Chapter 13 Introduction to OpenCL for the KeyStone II

: In this chapter, another programming model called Open Computing Language (OpenCL) is introduced. This chapter will emphasise the OpenCL for the KeyStone rather than other devices. This chapter will show that OpenCL is easy to use since the programmer does not need to deal with details of communication between DSP cores or between the ARM and the DSP, which may be a daunting task.

Chapter 14 Multicore Navigator

: This chapter shows how the Multicore Navigator can provide a high‐speed packed data transfer to enhance CorePac to accelerator/peripheral data movements, core‐to‐core data movements, inter‐core communication and synchronisation without loading the CorePacs. Examples are also provided.

Chapter 15 FIR filter implementation