33,59 €
Modern and complex SoCs can adapt to many demanding system requirements by combining the processing power of ARM processors and the feature-rich Xilinx FPGAs. You’ll need to understand many protocols, use a variety of internal and external interfaces, pinpoint the bottlenecks, and define the architecture of an SoC in an FPGA to produce a superior solution in a timely and cost-efficient manner.
This book adopts a practical approach to helping you master both the hardware and software design flows, understand key interconnects and interfaces, analyze the system performance and enhance it using the acceleration techniques, and finally build an RTOS-based software application for an advanced SoC design.
You’ll start with an introduction to the FPGA SoCs technology fundamentals and their associated development design tools. Gradually, the book will guide you through building the SoC hardware and software, starting from the architecture definition to testing on a demo board or a virtual platform. The level of complexity evolves as the book progresses and covers advanced applications such as communications, security, and coherent hardware acceleration.
By the end of this book, you'll have learned the concepts underlying FPGA SoCs’ advanced features and you’ll have constructed a high-speed SoC targeting a high-end FPGA from the ground up.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 491
Veröffentlichungsjahr: 2022
Design, develop, and debug complex FPGA-based systems-on-chip
Mounir Maaref
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Rahul Nair
Publishing Product Manager: Meeta Rajani
Senior Editor: Athikho Sapuni Rishana
Technical Editor: Rajat Sharma
Copy Editor: Safis Editing
Project Coordinator: Ashwin Kharwa
Proofreader: Safis Editing
Indexer: Rekha Nair
Production Designer: Aparna Bhagat
Marketing Coordinator: Nimisha Dua
First published: December 2022
Production reference: 1111122
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80181-099-9
www.packt.com
To my very dear parents, Rahma and Kouider, may they accept here the tribute of my gratitude, which, however great it may be, will never equal their tenderness and their devotion. To my beloved wife, Nadia, who fills my life with joy and happiness, and who without her support this work would not have been possible. To my lovely children, Anissa and Rayan, who give me the courage to persevere in my endeavors and the chance to be a parent, they help me every day give meaning to my life. To my dear sisters and brothers and to all my family and friends that they find here the expression of my most sincere feelings.
Mounir Maaref lives in the UK and works as a Principal SoC Architect. He has 25 years’ experience in the microelectronics industry spanning FPGAs, ASICs, embedded processing, networking, data storage, satellite communications, Bluetooth, and WiFi connectivity. He likes working on cutting edge technologies involving both hardware and software. His main focus is on the system architecture design, hardware and software interactions, performance analysis, and modeling. He has published several application notes and white papers and has been a speaker at many conferences worldwide. He holds a masters degree in Electronics and Telecoms. He is a 2nd dan black belt in Tang Soo Do and is getting trained to become a martial arts instructor.
I would like to thank all the friendly and professional staff at Packt, with whom I was privileged to interact at all the stages of writing and publishing this work. Their guidance and help were crucial in transforming my ideas into the material that we share in this book. I am immensely grateful to Martin Abrahams, my friend and ex-colleague at Micron, who took the time to review my work and provide valuable feedback that only made this book better. I would also like to thank all my friends and colleagues at Samsung Cambridge Solutions Centre, from whom I have been receiving a great opportunity to learn and contribute.
Martin Abrahams has worked in the semiconductor industry for over sixteen years in medical and automotive image sensor development for Micron and Onsemi. He is the architect for Onsemi of virtual prototyping for embedded microprocessor development using SystemC transaction-level modeling and leads the development of hardware component models for digital verification using SystemVerilog and UVM. Martin received a first class honours bachelor of science degree from the University of Bristol and was awarded a doctor of philosophy by the University of Southampton. Earlier in his career, he was a founder at EDA startup TransEDA. TransEDA developed the first VHDL and Verilog code coverage tools which are now central to modern RTL verification.
I would like to thank my family and colleagues at Onsemi for the time they have granted me to investigate, develop, and deploy new design and verification methodologies that transform the way we design digital electronics. I must also acknowledge their patience and encouragement through the many attempts it takes to get to a working solution. I would also like to thank Mounir for his kind invitation to review his book.
Modern and complex SoCs can adapt to many demanding system requirements by combining the processing power of ARM processors and the feature-rich Xilinx FPGAs. You’ll need to understand many protocols, use a variety of internal and external interfaces, pinpoint the bottlenecks, and define the architecture of an SoC in an FPGA to produce a superior solution in optimal time and with an optimal cost. This book adopts a practical approach to help you master both the hardware and software design flows, understand key interconnects and interfaces, analyze the system performance and enhance it using acceleration techniques, and finally, build an RTOS-based software application for an advanced SoC design.
Beginning with an introduction to the FPGA SoC technology fundamentals and their associated development design tools, this book will guide you in building the SoC hardware and software, starting from the architecture definition to testing on a demo board or a virtual platform. The level of complexity evolves as the book progresses and it covers advanced applications such as communications, security, and coherent hardware acceleration.
By the end of this book, you’ll have learned the concepts underlying FPGA SoC advanced features and you’ll have constructed a high-speed SoC targeting a high-end FPGA from the ground up.
This book is intended for FPGA and ASIC hardware and firmware developers, IoT engineers, SoC architects, and anyone interested in understanding the process of developing a complex SoC, including all aspects of the hardware design and the associated firmware design. Prior knowledge of digital electronics and some experience in coding in Very High-Speed Design Language (VHDL) or Verilog and C, or a similar language suitable for embedded systems, will be required for using this book. A general understanding of FPGA and CPU architecture will be helpful but not mandatory for using this book.
Chapter 1, Introducing FPGA Devices and SoCs, begins by describing the FPGA technology and its evolution since it was first invented by Xilinx in the 1980s. It goes over the electronics industry gap that the FPGA devices cover, their adoption, and their ease of use to implement custom digital hardware functions and systems. It then describes the high-speed SoCs and their evolution since they were introduced as a solution by the major FPGA vendors in the early 2000s. It looks at SoC classification for the targeted applications, specifically for FPGA implementations.
Chapter 2, FPGA Devices and SoC Design Tools, begins by giving an overview of the Xilinx FPGA hardware design flow in general and the tools associated with it. It then highlights the specific tools used when designing an SoC for FPGAs. It also introduces SoC design hardware verification using the available simulation tools. The chapter also covers the software design flow and its different steps and introduces the tools involved in every step of the software design for an FPGA-based SoC.
Chapter 3, Basic and Advanced On-Chip Busses and Interconnects, begins by giving an overview of the busses and interconnects used within an SoC. It introduces the concepts of data sharing and coherency and how to solve their associated challenges. It gives a good introduction to the AMBA and OCP protocols. It also covers data movement within an SoC and the use of DMA engines.
Chapter 4, Connecting High-Speed Devices Using Busses and Interconnects, begins by giving an overview of the busses and interconnects used off-chip to connect an SoC and/or an FPGA to other high-speed devices on the electronics board. It introduces the PCIe interconnect, the Ethernet interconnect, and the emerging Gen-Z protocol. It also introduces the emerging CCIX interconnect protocol and the concept of extending data coherency off-chip by adding protocol layers to manage it.
Chapter 5, Basic and Advanced SoC Interfaces, begins by defining an SoC interface for a given function. It classifies the SoC interfaces and lists their associated controller services. Then, the chapter covers processor caches and their organizations with a focus on ARMv7 architecture. It also introduces the processor memory management unit and its role in virtual-to-physical address translation and in implementing address space management and protection. It delves into the different memory and storage interfaces for on-chip and off-chip memories, their topologies and architectural features, and the criteria for choosing a given interface (or a combination of many).
Chapter 6, What Goes Where in a High-Speed SoC Design, teaches you about the SoC architecture definition phase that precedes the design and implementation phases. This phase is very useful to system architects as it translates a certain set of product requirements into a high-level description of the SoC design to accomplish. It details the criteria used during the functional decomposition stage in which a trade-off is reached between what is better suited to be implemented in hardware and what is rather a good target for a software implementation. It gives an overview of SoC system modeling using many available tools and environments.
Chapter 7, FPGA SoC Hardware Design and Verification Flow, delves into building the SoC hardware using all the tools introduced in the previous chapters. This chapter is hands-on, where you will build a simple but complete SoC for a Xilinx FPGA. You are guided through every step of the SoC hardware design phases, from the concept to the FPGA image generation. The chapter will also cover the hardware verification aspects, such as using the available Register Transfer Level (RTL) simulation tools to simulate part of the design and check for potential hardware issues.
Chapter 8, FPGA SoC Software Design Flow, focuses on the steps involved in building the software that will run on the SoC processors. You will first configure the software components needed by this phase of the design process, such as customizing the Board Support Package (BSP), configuring the libraries, and customizing the drivers for a simple application. You will revisit the SoC project built in the previous chapter to learn how to define a distributed software microarchitecture and will go through the steps of building all the project software components using bare-metal software applications targeting the SoC hardware.
Chapter 9, SoC Design Hardware and Software Integration, helps you to download an FPGA binary configuration file to the device and boot the SoC CPU's phase or target an emulation platform if a demo board isn’t available. You will debug the software running on the target platform (real hardware or virtual models) and gain practical familiarity with the available software debugging tools. You will also learn how to evaluate the software performance and understand its associated metrics using the software profiling tools in order to highlight any areas of concern in the designed system.
Chapter 10, Building a Complex SoC Hardware Targeting an FPGA, introduces you to some of the SoC design advanced topics that present many challenges to design engineers given their multidimensional nature. It will continue with the same practical approach as previous chapters by first adding more complex elements to the hardware design. It will now be built to host an embedded operating system as well. You will be introduced to the hardware acceleration techniques to help augment the system performance and equipped with the fundamental knowledge to make this step challenge-free. You will examine the different ways they can be applied and what system aspects need to be considered at the architectural level in the shared data paradigm.
Chapter 11, Addressing the Security Aspects of an FPGA-Based SoC, introduces you to the SoC security aspects and how these aspects are addressed by the FPGA SoC hardware. You will then learn about the security paradigms available in the ARM-based processors within the SoC hardware. The chapter will then introduce the security aspects from a software perspective and how they make use of the previously mentioned hardware security features to build a secure SoC in an FPGA.
Chapter 12, Building a Complex Software with an Embedded Operating System Flow, teaches you about the flow and helps you discover the tools used to build a complex software application to run on the complex FPGA SoC. You will use the design tools available to create the SoC BSP for the targeted embedded operating system, such as FreeRTOS. You will go through the process of generating an embedded bootloader for the target application to be used at runtime when the SoC is powered up or reset.
Chapter 13, Video, Image, and DSP Processing Principles in an FPGA and SoCs, introduces some of the advanced applications implemented in modern FPGAs and SoCs and what makes these devices such powerful compute engines for these types of compute- and bandwidth-demanding applications. It will clarify how parallel processing required by DSP applications in general can be easily implemented in the FPGA logic and how these parallel compute engines can be interfaced internally and externally to wide memories and internally to the powerful CPUs available in the SoCs.
Chapter 14, Communication and Control System Implementation in FPGAs and SoCs, continues introducing more advanced applications implemented in modern FPGAs and SoCs and explains what makes these devices such powerful compute engines for these types of I/O- and bandwidth-demanding applications. It will focus on some of the communication protocols that can make use of the FPGA multi-Gb transceivers, the logic that can perform packet inspections and filtering, and the CPU that can implement algorithms in the SW to manage the communication stack and interface to the user and other onboard devices. It will also cover control applications in the FPGA and SoCs and how they can benefit from all of their available features.
You need to have familiarity with digital electronics in general and, specifically, you need to have some fundamental knowledge of modern logic design at a RTL using a hardware design language such as VHDL, Verilog, or SystemVerilog. You will also need some working knowledge of embedded programming using a high-level language such as C or C++ and have some experience using cross-compilers to build executables for a target embedded processor. The hardware design flow and the embedded software design flow both use tools packaged within the Xilinx Vivado and Vitis environments. You will be guided through their installation processes on your host machine. If you are running a Windows operating system on your host machine, you will be guided through installing VirtualBox, which is an Oracle hypervisor to host a Linux guest operating system to be used as your development host operating system.
Software/hardware covered in the book
Operating system requirements
Xilinx Vivado 2021.2 or a higher version
Windows Enterprise and Professional 10.0
Or Window 10 HE and VirtualBox
Or Ubuntu 16.04.5 LTS up to 20.04.1 LTS
Xilinx Vitis
Windows Enterprise and Professional 10.0
Or Window 10 HE and VirtualBox
Or Ubuntu 16.04.5 LTS up to 20.04.1 LTS
VirtualBox
Windows 10
All the required installation steps are described in detail in the book using a simple, logical step-by-step approach that will get you up and running with the tools. Simply follow the instructions and don’t skip any steps during the installation process. Also, make sure you perform the required configuration when indicated to do so, so the tools are ready to use without wasting time debugging issues that are sometimes hard to track if a step in the installation or configuration process has been skipped or omitted.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
If you can get hold of a Xilinx Zynq-7000 SoC board, this will be great as you can download the FPGA bitstream and the executable software to the board and perform debugging and profiling on the real hardware. Nevertheless, if you don’t have access to a Xilinx Zynq-7000 SoC board, you can still use a virtual target within the Xilinx tools to debug and interact with the executable software running on the virtual target.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Architecting-and-Building-High-Speed-SoCs. If there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
The Code in Action videos for this book can be viewed at http://bit.ly/3NNFkZs.
We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/Hjk2H.
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “In PM, Vivado Design Suite uses a project file (.xpr) and directory structure to maintain the design source files.”
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “You first need to launch the Vivado IDE, then on the Vivado launch screen, choose Quick Start, then Create Project.”
Tips or important notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you’ve read Architecting and Building High Speed SoCs, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/9781801810999
Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directlyThis part introduces the main features and building blocks of SoCs and FPGA devices and associated design tools and provides an overview of the main on-chip and off-chip interconnects and interfaces.
This part comprises the following chapters:
Chapter 1, Introducing FPGA Devices and SoCsChapter 2, FPGA Devices and SoC Design ToolsChapter 3, Basic and Advanced On-Chip Busses and InterconnectsChapter 4, Connecting High-Speed Devices Using Busses and InterconnectsChapter 5, Basic and Advanced SoC InterfacesIn this chapter, we will begin by describing what the field-programmable gate array (FPGA) technology is and its evolution since it was first invented by Xilinx in the 1980s. We will cover the electronics industry gap that FPGA devices cover, their adoption, and their ease of use for implementing custom digital hardware functions and systems. Then, we will describe the high-speed FPGA-based system-on-a-chip (SoC) and its evolution since it was introduced as a solution by the major FPGA vendors in the early 2000s. Finally, we will look at how various applications classify SoCs, specifically for FPGA implementations.
In this chapter, we’re going to cover the following main topics:
Xilinx FPGA devices overviewXilinx SoC overview and historyXilinx Zynq-7000 SoC family hardware featuresXilinx Zynq UltraScale+ MPSoC family hardware featuresSoC in ASIC technologiesAn FPGA is a very large-scale integration (VLSI) integrated circuit (IC) that can contain hundreds of thousands of configurable logic blocks (CLBs), tens of thousands of predefined hardware functional blocks, hundreds of predefined external interfaces, thousands of memory blocks, thousands of input/output (I/O) pads, and even a fully predefined SoC centered around an IBM PowerPC or an ARM Cortex-A class processor in certain FPGA families. These functional elements are optimally spread around the FPGA silicon area and can be interconnected via programmable routing resources. This allows them to behave in a functional manner that’s desired by a logic designer so that they can meet certain design specifications and product requirements.
Application-specific integrated circuits (ASICs) and application-specific standard products (ASSPs) are VLSI devices that have been architected, designed, and implemented for a given product or a particular application domain. In contrast to ASICs and ASSPs, FPGA devices are generic ICs that can be programmed to be used in many applications and industries. FPGAs are usually reprogrammable as they are based on static random-access memory (SRAM) technology, but there is a type that is only programmed once: one-time programmable (OTP) FPGAs. Standard SRAM-based FPGAs can be reprogrammed as their design evolves or changes, even once they have been populated in the electronics design board and after being deployed in the field. The following diagram illustrates the concept of an FPGA IC:
Figure 1.1 – FPGA IC conceptual diagram
As we can see, the FPGA device is structured as a pool of resources that the design assembles to perform a given logical task.
Once the FPGA’s design has been finalized, a corresponding configuration binary file is generated to program the FPGA device. This is typically done directly from the host machine at development and verification time over JTAG. Alternatively, the configuration file can be stored in a non-volatile media on the electronics board and used to program the FPGA at powerup.
Xilinx shipped its first FPGA in 1985 and its first device was the XC2064; it offered 800 gates and was produced on a 2.0μ process. The Virtex UltraScale+ FPGAs, some of the latest Xilinx devices, are produced in a 14nm process node and offer high performance and a dense integration capability. Some modern FPGAs use 3D ICs stacked silicon interconnect (SSI) technology to work around the limitations of Moore’s law and pack multiple dies within the same package. Consequently, they now provide an immense 9 million system logic cells in a single FPGA device, a four order of magnitude increase in capacity alone compared to the first FPGA; that is, XC2064. Modern FPGAs have also evolved in terms of their functionality, higher external interface bandwidth, and a vast choice of supported I/O standards. Since their initial inception, the industry has seen a multitude of quantitative and qualitative advances in FPGA devices’ performance, density, and integrated functionalities. Also, the adoption of the technology has seen a major evolution, which has been aided by adequate pricing and Moore’s law advancements. These breakthroughs, combined with matching advances in software development tools, intellectual property (IP), and support technologies, have created a revolution in logic design that has also penetrated the SoC segment.
There has also been the emergence of the new Xilinx Versal devices portfolio, which targets the data center’s workload acceleration and offers a new AI-oriented architecture. This device class family is outside the scope of this book.
FPGAs were initially used as the electronics board glue logic of digital devices. They were used to implement buses, decode functions, and patch minor issues discovered in the board ASICs post-production. This was due to their limited capacities and functionalities. Today’s FPGAs can be used as the hearts of smart systems and are designed with their full capacities in terms of parallel processing and their flexible adaptability to emerging and changing standards, specifically at the higher layers, such as the Link and Transactions layers of new communication or interface protocols. These make reconfiguring FPGA the obvious choice in medium or even large deployments of these emerging systems. With the addition of ASIC class embedded processing platforms within the FPGA for integrating a full SoC, FPGA applications have expanded even deeper into industry verticals where it has seen limited useability in the past. It is also very clear that, with the prohibitive cost of non-recurring engineering (NRE) and producing ASICs at the current process nodes, FPGAs are becoming the first choice for certain applications. They also offer a very short time to market for certain segments where such a factor is critical for the product’s success.
FPGAs can be found across the board in the high-tech sector and range from the classical fields such as wired and wireless communication, networking, defense, aerospace, industrial, audio-video broadcast (AVB), ASIC prototyping, instrumentation, and medical verticals to the modern era of ADAS, data centers, the cloud and edge computing, high-performance computing (HPC), and ASIC emulation simulators. They have an appealing reason to be used almost everywhere in an electronics-based application.
Xilinx provides a comprehensive portfolio of FPGA devices to address different system design requirements across a wide range of the application’s spectrum. For example, Xilinx FPGA devices can help system designers construct a base platform for a high-performance networking application necessitating a very dense logic capacity, a very wide bandwidth, and performance. They can also be used for low-cost, small-footprint logic design applications using one of the low-cost FPGA devices either for high or low-volume end applications.
In this large offering, there are the cost-optimized families such as the Spartan-7 family and the Spartan-6 family, which are built using a 45nm process node, the Artix-7 family, and the Zynq-7000 family, which is built using a 28nm process node.
There is also the 7-series family in a 28nm process, which includes the Artix-7, Kintex-7, and Virtex-7 families of FPGAs, in addition to the Spartan-7 family.
Additionally, there are FPGAs from the UltraScale Kintex and Virtex families in a 20nm process node.
The UltraScale+ category contains three more additional families – the Artix UltraScale+, the Kintex UltraScale+, and the Virtex UltraScale+, all in a 16nm process node.
Each device family has a matrixial offering table that is defined by the density of logic, the number of functional hardware blocks, the capacity of the internal memory blocks, and the amount of I/Os in each package. This makes the offered combinations an interesting catalog to pick a device that meets the requirements of the system to build using the specific FPGA. To examine a given device offering matrix, you need to consult the specific FPGA family product table and product selection guide. For example, for the UltraScale+ FPGAs, please go to https://www.xilinx.com/content/dam/xilinx/support/documentation/selection-guides/ultrascale-plus-fpga-product-selection-guide.pdf.
As highlighted in the introduction to this chapter, modern Xilinx FPGA devices contain a vast list of hardware block features and external interfaces that relatively define their category or family and, consequently, make them suitable for a certain application or a specific market vertical. This chapter looks at the rich list of these features to help you understand what today’s FPGAs are capable of offering system designers. It is worth noting that not all the FPGAs contain all these elements.
For a detailed overview of these features, you are encouraged to examine the Xilinx UltraScale+ Data Sheet as a good starting point at https://www.xilinx.com/content/dam/xilinx/support/documentation/data_sheets/ds890-ultrascale-overview.pdf.
In the following subsections, we will summarize some of these features.
Modern Xilinx FPGAs have an abundance of CLBs. These CLBs are formed by lookup tables (LUTs) and registers known as flip-flops. These CLBs are the elementary ingredients that logic user functions are built from to form the desired engine to perform a combinatorial function that’s coupled (or not) with sequential logic. These are also built from Flip-Flop resources contained within the CLBs. Following a full design process from design capture, to synthesizing and implementing the production of a binary image to program the FPGA device, these CLBs are configured to operate in a manner that matches the aforementioned required partial task within the desired function defined by the user. The CLB can also be configured to behave as a deep shift register, a multiplexer, or a carry logic function. It can also be configured as distributed memory from which more SRAM memory is synthesized to complement the SRAM resources that can be built using the FPGA device block’s RAM.
Xilinx FPGAs have many block RAMs with built-in FIFO. Additionally, in UltraScale+ devices, there are 4Kx72 UltraRAM blocks. As mentioned previously, the CLB can also be configured as distributed memory from which more SRAM memory can be synthesized.
The Virtex UltraScale+ HBM FPGAs can integrate up to 16 GB of high-bandwidth memory (HBM) Gen2.
Xilinx Zynq UltraScale+ MPSoC also provides many layers of SRAM memory within its ARM-based SoC, such as OCM memory and the Level 1 and Level 2 caches of the integrated CPUs and GPUs.
Xilinx FPGAs are rich in resources for digital signal processing (DSP). They have DSP slices with 27x18 multipliers and rich local interconnects. The DSP slice has many usage possibilities, as described in the FPGA datasheet.
The Xilinx FPGA’s device interconnect employs a routing infrastructure, which is a combination of configurable switches and nets. These allow the FPGA elements such as the I/O blocks, the DSP slices, the memories, and the CLBs to be interconnected.
The efficiency of using these routing resources is as important as the device hardware’s logical resources and features. This is because they represent the nerve system of the FPGA device, their abundance of interconnect logic, and their functional elements, which are crucial to meeting the design performance criteria.
Xilinx FPGA devices contain many clock management elements, including digital local loops (DLLs) for clock generation and synthesis, global buffers for clock signal buffering, and routing infrastructure to meet the demands of many challenging design requirements. The flexibility of the clocking network minimizes the inter-signal delays or skews.
The Xilinx FPGAs can interface to many external parallel memories, including DDR4 SDRAM. Some FPGAs also support interfacing to external serial memories, such as Hybrid Memory Cube (HMC).
Xilinx FPGA devices interface to the external ICs through I/Os that support many standards and PHY protocols, including the serial multi-gigabit transceivers (MGTs), Ethernet, PCIe, and Interlaken.
The first device family that Xilinx brought to the market that integrated an ARM CPU was the Zynq-7000 SoC FPGA with its integrated ARM Cortex-A9 CPU. This family was followed by the Xilinx Zynq UltraScale+ MPSoCs and RFSoCs, which feature a processing system (PS) that includes a dual or a quad-core variant of the ARM Cortex-A53, and a dual-core ARM Cortex-R5F. Some variants have a graphics processing unit (GPU). We will delve into the Xilinx SoCs in the next chapter.
Being SRAM-based, the FPGA requires a configuration file to be loaded when powered up to define its functionality. Consequently, any errors that are encountered in the FPGA’s configuration binary image, either at configuration time or because of a physical problem in mission mode, will alter the overall system functionality and may even cause a disastrous outcome for sensitive applications. Therefore, it is a necessity for critical applications to have system monitoring to urgently intervene when such an error is discovered to correct it and limit any potential damage via its built-in self-monitoring mechanism.
Modern FPGAs provide decryption blocks to address security needs and protect the device’s hardware from hacking. FPGAs with integrated SoC and PS blocks have a configuration and security unit (CSU) that allows the device to be booted and configured safely.
In the early 2000s, Xilinx introduced the concept of building embedded processors into its available FPGAs at the time, namely the Spartan-2, Virtex-II, and Virtex-II Pro families. Xilinx brought two flavors of these early SoCs to the market: a soft version and an initial hard macro-based option in the Virtex-II Pro FPGAs.
The soft flavor uses MicroBlaze, a Xilinx RISC 32-bit based soft processor coupled initially with an IBM-based bus infrastructure called CoreConnect and a rich set of peripherals, such as a Gigabits Ethernet MACs, PCIe, and DDR DRAM, just to name a few. A typical MicroBlaze soft processor-based SoC looks as follows:
Figure 1.2 – Legacy FPGA MicroBlaze embedded system
The hard macro version uses a 32-bit IBM PowerPC 405 processor. It includes the CPU core, a memory management unit (MMU), 16 KB L1 data and 16 KB L1 instruction caches, timer resources, the necessary debug and trace interfaces, the CPU CoreConnect-based interfaces, and a fast memory interface known as on-chip memory (OCM). The OCM connects to a mapped region of internal SRAM that’s been built using the FPGA block RAMs for fast code and data access. The following diagram shows a PowerPC 405 embedded system in a Virtex-II Pro FPGA device:
Figure 1.3 – Virtex-II Pro PowerPC405 embedded system
Embedded processing within FPGAs has received a wide adoption from different vertical spaces and opened the path to many single-chip applications that previously required the use of an external CPU, alongside the FPGA device, as the main board processor.
The Virtex-4 FX was the next generation to include the IBM PowerPC 405 and improved its core speed.
The Virtex-5 FXT followed and integrated the IBM PowerPC 440x5 CPU, a dual-issue superscalar 32-bit embedded processor with an MMU, a 32 KB instruction cache, a 32 KB data cache, and a Crossbar interconnect. To interface with the rest of the FPGA logic, it has a processor local bus (PLB) interface, an auxiliary processor unit (APU) for connecting FPU, and a custom coprocessor built into the FPAG logic. It also has a high-speed memory controller interface. With the Ethernet Tri-Speed 10/100/1000 MACs integrated as hardware functional blocks in the FPGA, we started seeing the main ingredients necessary for making an SoC in FPGAs, with most of the logic-consuming hardware functions now bundled together around the CPU block or delivered as a hardware functional block that just needs interfacing and connecting to the CPU. This was a step close to a full SoC in FPGAs. The following diagram shows a PowerPC 440 embedded system in a Virtex-5 FXT FPGA device:
Figure 1.4 – Virtex-5 FXT PowerPC440 embedded system
The Virtex-5 FXT was the last Xilinx FPGA to include an IBM-based CPU; the future was switching to ARM and providing a full SoC in FPGAs with the possibility to interface to the FPGA logic through adequate ports. This offered the industry a new kind of SoC that, within the same device, combined the power of an ASIC and the programmability of the Xilinx-rich FPGAs. This brings us to this book’s main topic, where we will delve into and try to deal with all Xilinx’s related design development and technological aspects while taking an easy-to-follow and progressive approach.
The following diagram illustrates the approach taken by Xilinx to couple an ARM-based CPU SoC with the Xilinx FPGA logic in the same chip:
Figure 1.5 – Zynq-7000 SoC FPGA conceptual diagram
The first device family that Xilinx brought to the market for integrating an ARM Cortex-A9 CPU was the Zynq-7000 FPGA. The Cortex-A9 is a 32-bit processor that implements the ARMv7-A architecture and can run many instruction formats. These are available in two configurations: a single Cortex-A9 core in the Zynq-7000S devices and a dual Cortex-A9 cluster in the Zynq-7000 devices.
The next generation that followed was the Zynq UltraScale+ MPSoC devices, which provide a 64-bit ARM CPU cluster for integrating an ARM Cortex-A53, coupled with a 32-bit ARM Cortex-R5 in the same SoC. The Cortex-A53 CPU implements the ARMv8-A architecture, while the Cortex-R5 implements the ARMv7 architecture and, specifically, the R profile. The Zynq UltraScale+ MPSoC comes in different configurations. There is the CG series with a dual-core Cortex-A53 cluster, the EG series with a quad-core Cortex-A53 cluster and an ARM MALI GPU, and the EV series, which comes with additional video codecs to what is available in the EG series.
A few years ago, Xilinx also launched a version of the MPSoC with key components to help build advanced radio connectivity SoCs: the Zynq UltraScale+ RFSoC.
As mentioned previously, the Zynq FPGA SoC integrates a popular ARM CPU based on the ARMv7, and the classical FPGA part based on the Xilinx 7th generation logic with rich hardware features.
For a detailed description of the Zynq-7000 SoC FPGA and its features, please refer to the SoC Technical Reference Manual (TRM) available at https://www.xilinx.com/support/documentation/user_guides/ug585-Zynq-7000-TRM.pdf.
This section specifies the main Zynq-7000 SoC features and defines them to help you quickly visualize the device’s capabilities.
The SoC is mainly composed of an application processor unit (APU), a connectivity matrix, an OCM memory interface, external memory interfaces, and the I/O peripherals (IOP) block.
The following diagram provides a detailed architectural view of the Zynq-7000 SoC:
Figure 1.6 – Zynq-7000 SoC architecture – dual-core cluster example
The CPU cluster topology is built around an ARM Cortex-A9 CPU, which comes in a dual-core or a single-core MPCore. Each CPU core has an L1 instruction cache and an L1 data cache. It also has its own MMU, a floating-point unit (FPU), and a NEON SIMD engine. The CPU cluster has an L2 common cache and a snoop control unit (SCU). This SCU provides an accelerator coherency port (ACP) that extends cache coherency beyond the cluster with external masters when implemented in the FPGA logic.
Each core provides a performance figure of 2.5 DMIPS/MHz with an operating frequency ranging from 667 MHz to 1 GHz, depending on the Zynq FPGA speed grade. The FPU supports both single and double precision operands with a performance figure of 2.0 MFLOPS/MHz. The CPU core is TrustZone-enabled for secure operation. It supports code compression via the Thumb-2 instructions set. The Level 1 instructions and data caches are both 32 KB in size and are 4-way set-associative.
The CPU cluster supports both SMP and AMP operation modes. The Level 2 cache is 512 KB in size and is common to both CPU cores and for both instructions and data. The L2 cache is an eight-way set associative. The cluster also has a 256 KB OCM RAM that can be accessed by the APU and the programmable logic (PL).
The PS has 8-channel DMA engines that support transactions between memories, peripherals, and scatter-gather operations. Their interfaces are based on the AXI protocol. The FPGA PL can use up to four DMA channels.
The SoC has a general interrupt controller (GIC) version 1.0 (GIC v1). The GIC distributes interrupts to the CPU cluster cores according to the user’s configuration and provides support for priority and preemption.
The PS supports debugging and tracing and is based on ARM CoreSight interface technology.
The Zynq device supports both SDRAM DDR memory and static memories. DDR3/3L/2 and LPDDR2 speeds are supported. The static memory controllers interface to QSPI flash, NAND, and parallel NOR flash.
The SDRAM DDR interface has a dedicated 1 GB of system address space. It can be configured to interface to a full-width 32-bit wide memory or a half-width 16-bit wide memory. It provides support for many DDR protocols. The PS also includes the DDR PHY and can operate at many speeds – up to a maximum of 1,333 Mb/s. This is a multi-port controller that can share the SDRAM DDR memory bandwidth with many SoC clients within the PS or PL regions over four ports. The CPU cluster is connected to a port; two ports serve the PL, while the fourth port is exposed to the SoC central switches, making access possible to all the connected masters.
The following diagram is a memory-centric representation of the SDRAM DDR interface of the Zynq-7000 SoC:
Figure 1.7 – Zynq-7000 SoC DDR SDRAM memory controller
The static memory controller (SMC) is based on ARM’s PL353 IP. It can interface to NAND flash, SRAM, or NOR flash memories. It can be configured through an APB interface via its operational registers. The SMC supports the following external static memories:
64 MB of SRAM in 8-bit width64 MB of parallel NOR flash in 8-bit widthNAND flashThe following diagram provides a micro-architectural view of the Zynq-7000 SoC SMC:
Figure 1.8 – Zynq-7000 SoC static memory controller architecture
The IOP block of the Zynq-7000 SoC includes a QSPI flash interface. It supports serial flash memory devices, as well as three modes of operation: linear addressing mode, I/O mode, and legacy SPI mode.
The software implements the flash device protocol in I/O mode. It provides the commands and data to the controller using the interface registers and reads the received data from the flash memory via the flash registers.
In linear addressing mode, the controller maps the flash address space onto the AXI address space and acts as a translation block between them. Requests that are received on the AXI port of the QSPI controller are converted into the necessary commands and data phases, while read data is put on the AXI bus when it’s received from the flash memory device.
In legacy mode, the QSPI interface behaves just like an ordinary SPI controller.
To write the software drivers for a given flash device to control via the Zynq-7000 SoC QSPI controller, you should refer to both the flash device data sheet from the flash vendor and the QSPI controller operational mode settings detailed in the Zynq-7000 TRM. The URL for this was mentioned at the beginning of this section.
The QPSI controller supports multiple flash device arrangements, such as 8-bit access using two parallel devices (to double the device throughput) or a 4-bit dual rank (to increase the memory capacity).
The IOP block contains the external communication interfaces and includes two tri-mode (10/100/1 GB) Ethernet MACs, two USB 2.0 OTG peripherals, two full CAN bus interfaces, two SDIO controllers, two full-duplex SPI ports, two high-speed UARTs, and two master and slave I2C interfaces. It also includes four 32-bit banks GPIO. The IOP can interface externally through 54 flexible multiplexed I/Os (MIOs).
The interconnect is ARM AMBA AXI-based with QoS support. It groups masters and slaves from the PS and extends the connectivity to PL-implemented masters and slaves. Multiple outstanding transactions are supported. Through the Cortex-A9 ACP ports, I/O coherency is possible so that external masters and the CPU cores can coherently share data, minimizing the CPU core cache management operations. The interconnect topology is formed by many switches based on ARM NIC-301 interconnect and AMBA-3 ports. The following diagram provides an overview of the Zynq-7000 SoC interconnect:
Figure 1.9 – Zynq-7000 SoC interconnect topology
The Zynq UltraScale+ MPSoC is the second generation of the Xilinx SoC FPGAs based on the ARM CPU architecture. Like its predecessor, the Zynq-7000 SoC, it is based on the approach of combining the FPGA logic HW configurability and the SW programmability of its ARM CPUs but with improvements in both the FPGA logic and the ARM processor CPUs, as well as its PS features. The UltraScale+ MPSoC offers a heterogeneous topology that couples a powerful 64-bit application processor (implementing the ARMv8-A architecture) and a 32-bit real-time R-profile processor.
The PS includes many types of processing elements: an APU, such as the dual-core or quad-core Cortex-A53 cluster, the dual-core Cortex-R5F real-time processing unit(RPU), the Mali GPU, a PMU, and a video codec unit (VCU) in the EG series. The PS has an efficient power management scheme due to its granular power domains control and gated power islands. The Zynq UltraScale+ MPSoC has a configurable system interconnect and offers the user overall flexibility to meet many application requirements. The following diagram provides an architectural view of the Zynq UltraScale+ SoC:
Figure 1.10 – Zynq UltraScale+ MPSoC architecture – quad-core cluster
The following section provides a brief description of the main features of the Zynq UltraScale+ MPSoC. For a detailed technical description, please read the Zynq UltraScale+ MPSoC TRM at https://www.xilinx.com/support/documentation/user_guides/ug1085-zynq-ultrascale-trm.pdf.
The CPU cluster topology is built around an ARM Cortex-A53 CPU, which comes in a quad-core or a dual-core MPCore. The CPU cores implement the Armv8-A architecture with support for the A64 instruction set in AArh64 or the A32/T32 instruction set in AArch32. Each CPU core comes with an L1 instruction cache with parity protection and an L1 data cache with ECC protection. The L1 instruction cache is 2-way set-associative, while the L1 data cache is 4-way set-associative. It also has its own MMU, an FPU, and a Neon SIMD engine. The CPU cluster has a 16-way set-associative L2 common cache and an SCU with an ACP port that extends cache coherency beyond the cluster with external masters in the PL. Each CPU core provides a performance figure of 2.3 DMIPS/MHz with an operating frequency of up to 1.5 GHz. The CPU core is also TrustZone enabled for secure operations.
The CPU cluster can operate in symmetric SMP and asymmetric AMP modes with the power island gating for each processor core. Its unified Level 2 cache is ECC protected, is 1 MB in size, and is common to all CPU cores and both instructions and data.
The APU has a 128-bit AXI coherent extension (ACE) port that connects to the PS cache coherent interconnect (CCI), which is associated with the system memory management unit (SMMU). The APU has an ACP slave port that allows the PL master to coherently access the APU caches.
The APU has a GICv2 general interrupt controller (GIC). The GIC acts as a distributor of interrupts to the CPU cluster cores according to the user’s configuration, with support for priority, preemption, virtualization, and security. Each CPU core contains four of the ARM generic timers. The cluster has a watchdog timer (WDT), one global timer, and two triple timers/counters (TTCs).
The RPU contains a dual-core ARM Cortex-R5F cluster. The CPU cores are 32-bit real-time profile CPUs based on the ARM-v7R architecture. Each CPU core is associated with tightly coupled memory (TCM). TCM is deterministic and good for hosting real-time, latency-sensitive application code and data. The CPU cores have 32 KB L1 instruction and data caches. It has an interrupt controller and interfaces to the PS elements and the PL via two AXI-4 ports connected to the low-power domain switch. Software debugging and tracing is done via the ARM CoreSight Debug subsystem.
The PS includes an ARM Mali-400 GPU. The GPU includes a geometry processor (GP) and has an MMU and a Level 2 cache that’s 64 KB in size. The GPU supports OpenGL ES 1.1 and 2.0, as well as OpenVG 1.1 standards.
The video codec unit (VCU) supports H.265 and H.264 video encoding and decoding standards. The VCU can concurrently encode/decode up to 4Kx2K at 60 frames per second (FPS).
The PMU augments the PS with many functionalities for startup and low power modes, some of which are as follows:
System boot and initializationManages the wakeup events and low processing power tasks when the APU and RPU are in low-power statesControls the power-up and restarts on wakeupSequences the low-level events needed for power-up, power-down, and resetManages the clock gating and power domainsHandles system errors and their associated reportingPerforms memory scrubbing for error detection at runtimeThe PS has 8-channel DMA engines that support transactions between memories, peripherals, as well as scatter-gather operations. Their interfaces are based on the AXI protocol. They are split into two categories: the low power domain (LPD) DMA and full power domain (FPD) DMA. The LPD DMA is I/O coherent with the CCI, whereas the FPD DMA is not.
In this section, we will look at the various Zynq UltraScale+ MPSoC memory interfaces.
The PS has a multiport DDR SDRAM memory controller. Its internal interface consists of six AXI data ports and an AXI control interface. There is a port dedicated to the RPU, while two ports are connected to the CCI; the remaining ports are shared between the DisplayPort controller, the FPD DMA, and the PL. Different types of SDRAM DDR memories are supported, namely DDR3, DDR3L, LPDDR3, DDR4, and LPDDR4.
The external SMC supports managed NAND flash (eMMC 4.51) and NAND flash (24-bit ECC). Serial NOR flash is also supported via 1-bit, 2-bit, Quad-SPI, and dual Quad-SPI (8-bit).
The PS also has an on-chip RAM that’s 256 KB in size, which provides low latency storage for the CPU cores. The OCM controller provides eight exclusive access monitors to help implement inter-cluster atomic primitives for access to shared memory regions within the MPSoC.
The OCM memory is implemented as a 32-bit wide memory for achieving a high read/write throughput and uses read-modify-write operations for accesses that are smaller in size. It also has a protection unit and divides the OCM address space into 64 regions, where each region can have separate security and access attributes.
There are two Quad-SPI controllers in the IOP block of the PS, as follows:
A legacy Quad-SPI (LQSPI) controller that