Embedded Systems - Krzysztof Iniewski - E-Book

Embedded Systems E-Book

Krzysztof Iniewski

0,0
136,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Covers the significant embedded computing technologies--highlighting their applications in wireless communication and computing power An embedded system is a computer system designed for specific control functions within a larger system--often with real-time computing constraints. It is embedded as part of a complete device often including hardware and mechanical parts. Presented in three parts, Embedded Systems: Hardware, Design, and Implementation provides readers with an immersive introduction to this rapidly growing segment of the computer industry. Acknowledging the fact that embedded systems control many of today's most common devices such as smart phones, PC tablets, as well as hardware embedded in cars, TVs, and even refrigerators and heating systems, the book starts with a basic introduction to embedded computing systems. It hones in on system-on-a-chip (SoC), multiprocessor system-on-chip (MPSoC), and network-on-chip (NoC). It then covers on-chip integration of software and custom hardware accelerators, as well as fabric flexibility, custom architectures, and the multiple I/O standards that facilitate PCB integration. Next, it focuses on the technologies associated with embedded computing systems, going over the basics of field-programmable gate array (FPGA), digital signal processing (DSP) and application-specific integrated circuit (ASIC) technology, architectural support for on-chip integration of custom accelerators with processors, and O/S support for these systems. Finally, it offers full details on architecture, testability, and computer-aided design (CAD) support for embedded systems, soft processors, heterogeneous resources, and on-chip storage before concluding with coverage of software support--in particular, O/S Linux. Embedded Systems: Hardware, Design, and Implementation is an ideal book for design engineers looking to optimize and reduce the size and cost of embedded system products and increase their reliability and performance.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 577

Veröffentlichungsjahr: 2012

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title page

Copyright page

PREFACE

CONTRIBUTORS

1 Low Power Multicore Processors for Embedded Systems

1.1 MULTICORE CHIP WITH HIGHLY EFFICIENT CORES

1.2 SUPERH™ RISC ENGINE FAMILY (SH) PROCESSOR CORES

1.3 SH-X: A HIGHLY EFFICIENT CPU CORE

1.4 SH-X FPU: A HIGHLY EFFICIENT FPU

1.5 SH-X2: FREQUENCY AND EFFICIENCY ENHANCED CORE

1.6 SH-X3: MULTICORE ARCHITECTURE EXTENSION

1.7 SH-X4: ISA AND ADDRESS SPACE EXTENSION

2 Special-Purpose Hardware for Computational Biology

2.1 MOLECULAR DYNAMICS SIMULATIONS ON GRAPHICS PROCESSING UNITS

2.2 SPECIAL-PURPOSE HARDWARE AND NETWORK TOPOLOGIES FOR MD SIMULATIONS

2.3 QUANTUM MC APPLICATIONS ON FIELD-PROGRAMMABLE GATE ARRAYS

2.4 CONCLUSIONS AND FUTURE DIRECTIONS

3 Embedded GPU Design

3.1 INTRODUCTION

3.2 SYSTEM ARCHITECTURE

3.3 GRAPHICS MODULES DESIGN

3.4 SYSTEM POWER MANAGEMENT

3.5 IMPLEMENTATION RESULTS

3.6 CONCLUSION

4 Low-Cost VLSI Architecture for Random Block-Based Access of Pixels in Modern Image Sensors

4.1 INTRODUCTION

4.2 THE DVP INTERFACE

4.3 THE IBRIDGE-BB ARCHITECTURE

4.4 HARDWARE IMPLEMENTATION

4.5 CONCLUSION

ACKNOWLEDGMENTS

5 Embedded Computing Systems on FPGAs

5.1 FPGA ARCHITECTURE

5.2 FPGA CONFIGURATION TECHNOLOGY

5.3 SOFTWARE SUPPORT

5.4 FINAL SUMMARY OF CHALLENGES AND OPPORTUNITIES FOR EMBEDDED COMPUTING DESIGN ON FPGAS

6 FPGA-Based Emulation Support for Design Space Exploration

6.1 INTRODUCTION

6.2 STATE OF THE ART

6.3 A TOOL FOR ENERGY-AWARE FPGA-BASED EMULATION: THE MADNESS PROJECT EXPERIENCE

6.4 ENABLING FPGA-BASED DSE: RUNTIME-RECONFIGURABLE EMULATORS

6.5 USE CASES

7 FPGA Coprocessing Solution for Real-Time Protein Identification Using Tandem Mass Spectrometry

7.1 INTRODUCTION

7.2 PROTEIN IDENTIFICATION BY SEQUENCE DATABASE SEARCHING USING MS/MS DATA

7.3 RECONFIGURABLE COMPUTING PLATFORM

7.4 FPGA IMPLEMENTATION OF THE MS/MS SEARCH ENGINE

7.5 SUMMARY

ACKNOWLEDGMENTS

8 Real-Time Configurable Phase-Coherent Pipelines

8.1 INTRODUCTION AND PURPOSE

8.2 HISTORY AND RELATED METHODS

8.3 IMPLEMENTATION FRAMEWORK

8.4 PROTOTYPE IMPLEMENTATION

8.5 ASSESSMENT COMPARED WITH RELATED METHODS

9 Low Overhead Radiation Hardening Techniques for Embedded Architectures

9.1 INTRODUCTION

9.2 RECENTLY PROPOSED SEU TOLERANCE TECHNIQUES

9.3 RADIATION-HARDENED RECONFIGURABLE ARRAY WITH INSTRUCTION ROLLBACK

9.4 CONCLUSION

10 Hybrid Partially Adaptive Fault-Tolerant Routing for 3D Networks-on-Chip

10.1 INTRODUCTION

10.2 RELATED WORK

10.3 PROPOSED 4NP-FIRST ROUTING SCHEME

10.4 EXPERIMENTS

10.5 CONCLUSION

11 Interoperability in Electronic Systems

11.1 INTEROPERABILITY

11.2 THE BASIS FOR INTEROPERABILITY: THE OSI MODEL

11.3 HARDWARE

11.4 FIRMWARE

11.5 PARTITIONING THE SYSTEM

11.6 EXAMPLES OF INTEROPERABLE SYSTEMS

12 Software Modeling Approaches for Presilicon System Performance Analysis

12.1 INTRODUCTION

12.2 METHODOLOGIES

12.3 RESULTS

12.4 CONCLUSION

13 Advanced Encryption Standard (AES) Implementation in Embedded Systems

13.1 INTRODUCTION

13.2 FINITE FIELD

13.3 THE AES

13.4 HARDWARE IMPLEMENTATIONS FOR AES

13.5 HIGH-SPEED AES ENCRYPTOR WITH EFFICIENT MERGING TECHNIQUES

13.6 CONCLUSION

14 Reconfigurable Architecture for Cryptography over Binary Finite Fields

14.1 INTRODUCTION

14.2 BACKGROUND

14.3 RECONFIGURABLE PROCESSOR

14.4 RESULTS

14.5 CONCLUSIONS

Index

Copyright © 2013 by John Wiley & Sons, Inc. All rights reserved

Published by John Wiley & Sons, Inc., Hoboken, New Jersey

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Iniewski, Krzysztof.

 Embedded systems : hardware, design, and implementation / by Krzysztof Iniewski.

pages cm

Includes bibliographical references and index.

 ISBN 978-1-118-35215-1 (hardback)

 1. Embedded computer systems. I. Title.

 TK7895.E42I526 2012

 006.2'2–dc23

2012034412

PREFACE

Embedded computer systems surround us: smartphones, PC (personal computer) tablets, hardware embedded in cars, TVs, and even refrigerators or heating systems. In fact, embedded systems are one of the most rapidly growing segments of the computer industry today. This book offers the fundamentals of how these embedded systems can benefit anyone who has to build, evaluate, and apply these systems.

Embedded systems have become more and more prevalent over the years. Devices that we use every day have become intelligent and incorporate electronics. Data acquisition products no longer act independently. They are part of an ecosystem of interoperable communication devices. A device acquires some information, another device acquires other information, and that information is sent to a central unit for analysis. This idea of an ecosystem is powerful and flexible for the coming generation. Today, we have many examples of interoperable systems with ecosystems that address everyday problems. The ever-growing ecosystem of interoperable devices leads to increased convenience for the user and more information to solve problems. The elements of the system and the ecosystem will be explained in this book, with the understanding of the use cases and applications explained in detail.

Embedded systems are composed of hardware and software computations that are subject to physical real-time constraints. A key challenge to embedded systems design is the development of efficient methods to test and prototype realistic application configurations. This involves ultra high-speed input generation, accurate measurement and output logging, reliable environment modeling, complex timing and synchronization, hardware-in-the-loop simulation, and sophisticated analysis and visualization.

Ever since their introduction decades ago, embedded processors have undergone extremely significant transformations—ranging from relatively simple microcontrollers to tremendously complex systems-on-chip (SoCs). The best example is the iPhone, which is an embedded system platform that surpasses older personal computers or laptops. This trend will clearly continue with embedded systems virtually taking over our lives. Whoever can put together the best embedded system in the marketplace (for the given application) will clearly dominate worldwide markets. Apple is already the most valuable company in the world, surpassing Microsoft, Exxon, or Cisco.

With progress in computing power, wireless communication capabilities, and integration of various sensors and actuators, the sky is the limit for the embedded systems applications. With everyone having a smartphone in their pocket, life will be quite different from what it was 5 years ago, and the first signs are clearly visible today.

The book contains 14 carefully selected chapters. They cover areas of multicore processors, embedded graphics processing unit (GPU) and field-programmable gate array (FPGA) designs used in computing, communications, biology, industrial, and space applications. The authors are well-recognized experts in their fields and come from both academia and industry.

With such a wide variety of topics covered, I am hoping that the reader will find something stimulating to read, and discover the field of embedded systems to be both exciting and useful in science and everyday life. Books like this one would not be possible without many creative individuals meeting together in one place to exchange thoughts and ideas in a relaxed atmosphere. I would like to invite you to attend the CMOS Emerging Technologies events that are held annually in beautiful British Columbia, Canada, where many topics covered in this book are discussed. See http://www.cmoset.com for presentation slides from the previous meeting and announcements about future ones.

I would love to hear from you about this book. Please email me at [email protected].

Let the embedded systems of the world prosper and benefit us all!

KRIS INIEWSKIVancouver, 2012

CONTRIBUTORS

SAMUEL ANTÃO, INESC-ID/IST, Universidade Técnica de Lisboa, Lisbon, Portugal

FUMIO ARAKAWA, Renesas Electronics Corporation, Tokyo, Japan

ROBERT J. BEYNON, Protein Function Group, Institute of Integrative Biology, University of Liverpool, Liverpool, UK

ISTVÁN BOGDÁN, Department of Automatic Control and Systems Engineering, University of Sheffield, Sheffield, UK

SAI RAHUL CHALAMALASETTI, Department of Electrical and Computer Engineering, University of Massachusetts Lowell, Lowell, MA

RICARDO CHAVES, INESC-ID/IST, Universidade Técnica de Lisboa, Lisbon, Portugal

DANIEL COCA, Department of Automatic Control and Systems Engineering, University of Sheffield, Sheffield, UK

EZZ EL-MASRY, Dalhousie University, Halifax, Nova Scotia, Canada

KAMAL EL-SANKARY, Dalhousie University, Halifax, Nova Scotia, Canada

ISSAM HAMMAD, Dalhousie University, Halifax, Nova Scotia, Canada

TAREQ HASAN KHAN, Department of Electrical and Computer Engineering, University of Saskatchewan, Saskatchewan, Canada

ANDREW LEONE, STMicroelectronics, Geneva, Switzerland

MARTIN MARGALA, Department of Electrical and Computer Engineering, University of Massachusetts Lowell, Lowell, MA

PAOLO MELONI, Department of Electrical and Electronic Engineering, University of Cagliari, Cagliari, Italy

BYEONG-GYU NAM, Chungnam National University, Daejeon, South Korea

SUDEEP PASRICHA, Department of Electrical and Computer Engineering, Colorado State University, Fort Collins, CO

SOHAN PUROHIT, Department of Electrical and Computer Engineering, University of Massachusetts Lowell, Lowell, MA

LUIGI RAFFO, Department of Electrical and Electronic Engineering, University of Cagliari, Cagliari, Italy

FREDERIC RISACHER, Platform Development – Modeling, Research In Motion Limited, Waterloo, Ontario, Canada

DAVID K. RUTISHAUSER, NASA Johnson Space Center, Houston, TX

KENNETH J. SCHULTZ, Platform Development – Modeling, Research In Motion Limited, Waterloo, Ontario, Canada

SIMONE SECCHI, Department of Electrical and Electronic Engineering, University of Cagliari, Cagliari, Italy

LESLEY SHANNON, Simon Fraser University, Burnaby, British Columbia, Canada

ROBERT L. SHULER, JR., NASA Johnson Space Center, Houston, TX

LEONEL SOUSA, INESC-ID/IST, Universidade Técnica de Lisboa, Lisbon, Portugal

SIDDHARTH SRINIVASAN, Zymeworks Vancouver, British Columbia, Canada

KHAN WAHID, Department of Electrical and Computer Engineering, University of Saskatchewan, Saskatchewan, Canada

HOI-JUN YOO, Korea Advanced Institute of Science and Technology, Daejeon, South Korea

YONG ZOU, Department of Electrical and Computer Engineering, Colorado State University, Fort Collins, CO

1

Low Power Multicore Processors for Embedded Systems

FUMIO ARAKAWA

1.1 MULTICORE CHIP WITH HIGHLY EFFICIENT CORES

A multicore chip is one of the most promising approaches to achieve high performance. Formerly, frequency scaling was the best approach. However, the scaling has hit the power wall, and frequency enhancement is slowing down. Further, the performance of a single processor core is proportional to the square root of its area, known as Pollack’s rule [1], and the power is roughly proportional to the area. This means lower performance processors can achieve higher power efficiency. Therefore, we should make use of the multicore chip with relatively low performance processors.

The power wall is not a problem only for high-end server systems. Embedded systems also face this problem for further performance improvements [2]. MIPS is the abbreviation of million instructions per second, and a popular integer-performance measure of embedded processors. The same performance processors should take the same time for the same program, but the original MIPS varies, reflecting the number of instructions executed for a program. Therefore, the performance of a Dhrystone benchmark relative to that of a VAX 11/780 minicomputer is broadly used [3, 4]. This is because it achieved 1 MIPS, and the relative performance value is called VAX MIPS or DMIPS, or simply MIPS. Then GIPS (giga-instructions per second) is used instead of the MIPS to represent higher performance.

Figure 1.1 roughly illustrates the power budgets of chips for various application categories. The horizontal and vertical axes represent performance (DGIPS) and efficiency (DGIPS/W) in logarithmic scale, respectively. The oblique lines represent constant power (W) lines and constant product lines of the power–performance ratio and the power (DGIPS2/W). The product roughly indicates the attained degree of the design. There is a trade-off relationship between the power efficiency and the performance. The power of chips in the server/personal computer (PC) category is limited at around 100 W, and the chips above the 100-W oblique line must be used. Similarly, the chips roughly above the 10- or 1-W oblique line must be used for equipped-devices/mobile PCs, or controllers/mobile devices, respectively. Further, some sensors must use the chips above the 0.1-W oblique line, and new categories may grow from this region. Consequently, we must develop high DGIPS2/W chips to achieve high performance under the power limitations.

FIGURE 1.1. Power budgets of chips for various application categories.

Figure 1.2 maps various processors on a graph, whose horizontal and vertical axes respectively represent operating frequency (MHz) and power–frequency ratio (MHz/W) in logarithmic scale. Figure 1.2 uses MHz or GHz instead of the DGIPS of Figure 1.1. This is because few DGIPS of the server/PC processors are disclosed. Some power values include leak current, whereas the others do not; some are under the worst conditions while the others are not. Although the MHz value does not directly represent the performance, and the power measurement conditions are not identical, they roughly represent the order of performance and power. The triangles and circles represent embedded and server/PC processors, respectively. The dark gray, light gray, and white plots represent the periods up to 1998, after 2003, and in between, respectively. The GHz2/W improved roughly 10 times from 1998 to 2003, but only three times from 2003 to 2008. The enhancement of single cores is apparently slowing down. Instead, the processor chips now typically adopt a multicore architecture.

FIGURE 1.2. Performance and efficiency of various processors.

Figure 1.3 summarizes the multicore chips presented at the International Solid-State Circuit Conference (ISSCC) from 2005 to 2008. All the processor chips presented at ISSCC since 2005 have been multicore ones. The axes are similar to those of Figure 1.2, although the horizontal axis reflects the number of cores. Each plot at the start and end points of an arrow represent single core and multicore, respectively.

FIGURE 1.3. Some multicore chips presented at ISSCC.

The performance of multicore chips has continued to improve, which has compensated for the slowdown in the performance gains of single cores in both the embedded and server/PC processor categories. There are two types of muticore chips. One type integrates multiple-chip functions into a single chip, resulting in a multicore SoC. This integration type has been popular for more than 10 years. Cell phone SoCs have integrated various types of hardware intellectual properties (HW-IPs), which were formerly integrated into multiple chips. For example, an SH-Mobile G1 integrated the function of both the application and baseband processor chips [5], followed by SH-Mobile G2 [6] and G3 [7, 8], which enhanced both the application and baseband functionalities and performance. The other type has increased number of cores to meet the requirements of performance and functionality enhancement. The RP-1, RP-2 and RP-X are the prototype SoCs, and an SH2A-DUAL [9] and an SH-Navi3 [10] are the multicore products of this enhancement type. The transition from single core chips to multicore ones seems to have been successful on the hardware side, and various multicore products are already on the market. However, various issues still need to be addressed for future multicore systems.

The first issue concerns memories and interconnects. Flat memory and interconnect structures are the best for software, but hardly possible in terms of hardware. Therefore, some hierarchical structures are necessary. The power of on-chip interconnects for communications and data transfers degrade power efficiency, and a more effective process must be established. Maintaining the external input/output (I/O) performance per core is more difficult than increasing the number of cores, because the number of pins per transistors decreases for finer processes. Therefore, a breakthrough is needed in order to maintain the I/O performance.

The second issue concerns runtime environments. The performance scalability was supported by the operating frequency in single core systems, but it should be supported by the number of cores in multicore systems. Therefore, the number of cores must be invisible or virtualized with small overhead when using a runtime environment. A multicore system will integrate different subsystems called domains. The domain separation improves system reliability by preventing interference between domains. On the other hand, the well-controlled domain interoperation results in an efficient integrated system.

The third issue relates to the software development environments. Multicore systems will not be efficient unless the software can extract application parallelism and utilize parallel hardware resources. We have already accumulated a huge amount of legacy software for single cores. Some legacy software can successfully be ported, especially for the integration type of mul­ticore SoCs, like the SH-Mobile G series. However, it is more difficult with the enhancement type. We must make a single program that runs on multicore, or distribute functions now running on a single core to multicore. Therefore, we must improve the portability of legacy software to the multicore systems. Developing new highly parallel software is another issue. An application or parallelization specialist could do this, although it might be necessary to have specialists in both areas. Further, we need a paradigm shift in the development, for example, a higher level of abstraction, new parallel languages, and assistant tools for effective parallelization.

1.2 SUPERH™ RISC ENGINE FAMILY (SH) PROCESSOR CORES

As mentioned above, a multicore chip is one of the most promising approaches to realize high efficiency, which is the key factor to achieve high performance under some fixed power and cost budgets. Therefore, embedded systems are employing multicore architecture more and more. The multicore is good for multiplying single-core performance with maintaining the core efficiency, but does not enhance the efficiency of the core itself. Therefore, we must use highly efficient cores. SuperH™ (Renesas Electronics, Tokyo) reduced instruction set computer (RISC) engine family (SH) processor cores are highly efficient typical embedded central processing unit (CPU) cores for both single- and multicore chips.

1.2.1 History of SH Processor Cores

Since the beginning of the microprocessor history, a processor especially for PC/servers had continuously advanced its performance while maintaining a price range from hundreds to thousands of dollars [11, 12]. On the other hand, a single-chip microcontroller had continuously reduced its price, resulting in the range from dozens of cents to several dollars with maintaining its performance, and had been equipped to various products [13]. As a result, there was a situation of no demand on the processor of the middle price range from tens to hundreds of dollars.

However, with the introduction of the home game console in the late 1980s and the digitization of the home electronic appliances from the 1990s, there occurred the demands to a processor suitable for multimedia processing in this price range. Instead of seeking high performance, such a processor has attached great importance to high efficiency. For example, the performance is 1/10 of a processor for PCs, but the price is 1/100, or the performance equals to a processor for PCs for the important function of the product, but the price is 1/10. The improvement of area efficiency has become the important issue in such a processor.

In the late 1990s, a high performance processor consumed too high power for mobile devices, such as cellular phones and digital cameras, and the demand was increasing on the processor with higher performance and lower power for multimedia processing. Therefore, the improvement of the power efficiency became the important issues. Furthermore, when the 2000s begins, more functions were integrated by further finer processes, but on the other hand, the increase of the initial and development costs became a serious problem. As a result, the flexible specification and the cost reduction came to be important issues. In addition, the finer processes suffered from the more leakage current.

Under the above background, embedded processors were introduced to meet the requirements, and have improved the area, power, and development efficiencies. The SH processor cores are one of such highly efficient CPU cores.

The first SH processor was developed based on SuperH architecture as one of embedded processors in 1993. Then the SH processors have been developed as a processor with suitable performance for multimedia processing and area-and-power efficiency. In general, performance improvement causes degradation of the efficiency as Pollack’s rule indicates [1]. However, we can find ways to improve both performance and efficiency. Although individually each method is a small improvement, overall it can still make a difference.

The first-generation product, SH-1, was manufactured using a 0.8-µm process, operated at 20 MHz, and achieved performance of 16 MIPS in 500 mW. It was a high performance single-chip microcontroller, and integrated a read-only memory (ROM), a random access memory (RAM), a direct memory access controller (DMAC), and an interrupt controller.

The second-generation product, SH-2, was manufactured using the same 0.8-µm process as the SH-1 in 1994 [14]. It operated at 28.5 MHz, and achieved performance of 25 MIPS in 500 mW by optimization on the redesign from the SH-1. The SH-2 integrated a cache memory and an SDRAM controller instead of the ROM and the RAM of the SH-1. It was designed for the systems using external memories. The integrated SDRAM controller did not popular at that time, but enabled to eliminate an external circuitry, and contributed to system cost reduction. In addition, the SH-2 integrated a 32-bit multiplier and a divider to accelerate multimedia processing. And it was equipped to a home game console, which was one of the most popular digital appliances. The SH-2 extend the application field of the SH processors to the digital appliances with multimedia processing.

The third-generation product SH-3 was manufactured using a 0.5-µm process in 1995 [15]. It operated at 60 MHz, and achieved performance of 60 MIPS in 500 mW. Its power efficiency was improved for a mobile device. For example, the clock power was reduced by dividing the chip into plural clock regions and operating each region with the most suitable clock frequency. In addition, the SH-3 integrated a memory management unit (MMU) for such devices as a personal organizer and a handheld PC. The MMU is necessary for a general-purpose operating system (OS) that enables various application programs to run on the system.

The fourth-generation product, SH-4, was manufactured using a 0.25-µm process in 1997 [16–18]. It operated at 200 MHz, and achieved performance of 360 MIPS in 900 mW. The SH-4 was ported to a 0.18-µm process, and its power efficiency was further improved. The power efficiency and the product of performance and the efficiency reached to 400 MIPS/W and 0.14 GIPS2/W, respectively, which were among the best values at that time. The product rough­ly indicates the attained degree of the design, because there is a trade-off relationship between performance and efficiency.

The fifth-generation processor, SH-5, was developed with a newly defined instruction set architecture (ISA) in 2001 [19–21], and an SH-4A, the advanced version of the SH-4, was also developed with keeping the ISA compatibility in 2003. The compatibility was important, and the SH-4A was used for various products. The SH-5 and the SH-4A were developed as a CPU core connected to other various HW-IPs on the same chip with a SuperHyway standard internal bus. This approach was available using the fine process of 0.13 µm, and enabled to integrate more functions on a chip, such as a video codec, 3D graphics, and global positioning systems (GPS).

An SH-X, the first generation of the SH-4A processor core series, achieved a performance of 720 MIPS with 250 mW using a 0.13-µm process [22–26]. The power efficiency and the product of performance and the efficiency reached to 2,880 MIPS/W and 2.1 GIPS2/W, respectively, which were among the best values at that time. The low power version achieved performance of 360 MIPS and power efficiency of 4,500 MIPS/W [27–29].

An SH-X2, the second-generation core, achieved 1,440 MIPS using a 90-nm process, and the low power version achieved power efficiency of 6,000 MIPS/W in 2005 [30–32]. Then it was integrated on product chips [5–8].

An SH-X3, the third-generation core, supported multicore features for both SMP and AMP [33, 34]. It was developed using a 90-nm generic process in 2006, and achieved 600 MHz and 1,080 MIPS with 360 mW, resulting in 3,000 MIPS/W and 3.2 GIPS2/W. The first prototype chip of the SH-X3 was a RP-1 that integrated four SH-X3 cores [35–38], and the second one was a RP-2 that integrated eight SH-X3 cores [39–41]. Then, it was ported to a 65-nm low power process, and used for product chips [10].

An SH-X4, the latest fourth-generation core, was developed using a 45-nm low power process in 2009, and achieved 648 MHz and 1,717 MIPS with 106 mW, resulting in 16,240 MIPS/W and 28 GIPS2/W [42–44].

1.2.2 Highly Efficient ISA

Since the beginning of the RISC architecture, all the RISC processor had adopted a 32-bit fixed-length ISA. However, such a RISC ISA causes larger code size than a conventional complex instruction set computer (CISC) ISA, and requires larger capacity of program memories including an instruction cache. On the other hand, a CISC ISA has been variable length to define the instructions of various complexities from simple to complicated ones. The variable length is good for realizing the compact code sizes, but requires complex decoding, and is not suitable for parallel decoding of plural instructions for the superscalar issue.

SH architecture with the 16-bit fixed-length ISA was defined in such a situation to achieve compact code sizes and simple decoding. The 16-bit fixed-length ISA was spread to other processor ISAs, such as ARM Thumb and MIPS16.

As always, there should be pros and cons of the selection, and there are some drawbacks of the 16-bit fixed-length ISA, which are the restriction of the number of operands and the short literal length in the code. For example, an instruction of a binary operation modifies one of its operand, and an extra data transfer instruction is necessary if the original value of the modified operand must be kept. A literal load instruction is necessary to utilize a longer literal than that in an instruction. Further, there is an instruction using an implicitly defined register, which contributes to increase the number of operand with no extra operand field, but requires special treatment to identify it, and spoils orthogonal characteristics of the register number decoding. Therefore, careful implementation is necessary to treat such special features.

1.2.3 Asymmetric In-Order Dual-Issue Superscalar Architecture

Since a conventional superscalar processor gave priority to performance, the superscalar architecture was considered to be inefficient, and scalar architecture was still popular for embedded processors. However, this is not always true. Since the SH-4 design, SH processors have adopted the superscalar architecture by selecting an appropriate microarchitecture with considering efficiency seriously for an embedded processor.

The asymmetric in-order dual-issue superscalar architecture is the base microarchitecture of the SH processors. This is because it is difficult for a general-purpose program to utilize the simultaneous issue of more than two instructions effectively; a performance enhancement is not enough to compensate the hardware increase for the out-of-order issue, and symmetric superscalar issue requires resource duplications. Then, the selected architecture can maintain the efficiency of the conventional scalar issue one by avoiding the above inefficient choices.

The asymmetric superscalar architecture is sensitive to instruction categorizing, because the same category instruction cannot be issued simultaneously. For example, if we categorize all floating-point instructions in the same category, we can reduce the number of floating-point register ports, but cannot issue both floating-point instructions of arithmetic and load/store/transfer operations at a time. This degrades the performance. Therefore, the categorizing requires careful trade-off consideration between performance and hardware cost.

First of all, both the integer and load/store instructions are used most frequently, and categorized to different groups of integer (INT) and load/store (LS), respectively. This categorization requires address calculation unit in addition to the conventional arithmetic logical unit (ALU). Branch instructions are about one-fifth of a program on average. However, it is difficult to use the ALU or the address calculation unit to implement the early-stage branch, which calculates the branch addresses at one-stage earlier than the other type of operations. Therefore, the branch instruction is categorized in another group of branch (BR) with a branch address calculation unit. Even a RISC processor has a special instruction that cannot fit to the superscalar issue. For example, some instruction changes a processor state, and is categorized to a group of nonsuperscalar (NS), because most of instructions cannot be issued with it.

The 16-bit fixed-length ISA frequently uses an instruction to transfer a literal or register value to a register. Therefore, the transfer instruction is categorized to the BO group to be executable on both integer and load/store (INT and LS) pipelines, which were originally for the INT and LS groups. Then the transfer instruction can be issued with no resource conflict. A usual program cannot utilize all the instruction issue slots of conventional RISC architecture that has three operand instructions and uses transfer instructions less frequently. Extra transfer instructions of the 16-bit fixed-length ISA can be inserted easily with no resource conflict to the issue slots that would be empty for a conventional RISC.

The floating-point load/store/transfer and arithmetic instructions are categorized to the LS group and a floating-point execution (FE) group, respectively. This categorization increases the number of the ports of the floating-point register file. However, the performance enhancement deserves the increase. The floating-point transfer instructions are not categorized to the BO group. This is because neither the INT nor FE group fit to the instruction. The INT pipeline cannot use the floating-point register file, and the FE pipeline is too complicated to treat the simple transfer operation. Further, the transfer instruction is often issued with a FE group instruction, and the categorization to other than the FE group is enough condition for the performance.

The SH ISA supports floating-point sign negation and absolute value (FNEG and FABS) instructions. Although these instructions seem to fit the FE group, they are categorized to the LS group. Their operations are simple enough to execute at the LS pipeline, and the combination of another arithmetic instruction becomes a useful operation. For example, the FNEG and floating-point multiply–accumulate (FMAC) instructions became a multiply-and-subtract operation.

Table 1.1 summarizes the instruction categories for asymmetric superscalar architecture. Table 1.2 shows the ability of simultaneous issue of two instructions. As an asymmetric superscalar processor, each pipeline for the INT, LS, BR, or FE group is one, and the simultaneous issue is limited to a pair of different group instructions, except for a pair of the BO group instructions, which can be issued simultaneously using both the INT and LS pipelines. An NS group instruction cannot be issued with another instruction.

TABLE 1.1. Instruction Categories for Asymmetric Superscalar Architecture

INT

FE

ADD; ADDC; ADDV;SUB; SUBC; SUBV;MUL; MULU; MULS;DMULU; DMULS;DIV0U; DIV0S; DIV1;CMP; NEG; NEGC; NOT;DT; MOVT; CLRT; SETT;CLRMAC; CLRS; SETS;TST Rm, Rn; TST imm, R0;AND Rm, Rn; AND imm, R0;OR Rm, Rn; OR imm, R0;XOR Rm, Rn; XOR imm, R0;ROTL; ROTR; ROTCL; ROTCR;SHAL; SHAR; SHAD; SHLD;SHLL; SHLL2; SHLL8; SHLL16;SHLR; SHLR2; SHLR8; SHLR16;EXTU; EXTS; SWAP; XTRCT

FADD; FSUB; FMUL;

FDIV; FSQRT; FCMP;

FLOAT; FTRC;

FCNVSD; FCNVDS;

FMAC; FIPR; FTRV;

FSRRA; FSCA;

FRCHG; FSCHG; FPCHG

BO

MOV imm, Rn;

MOV Rm, Rn; NOP

BR

BRA; BSR; BRAF; BSRF;BT; BF; BT/S; BF/S;JMP; JSR; RTS

NS

AND imm, @(R0,GBR);OR imm, @(R0,GBR);XOR imm, @(R0,GBR);TST imm, @(R0,GBR);MAC; SYNCO;MOVLI; MOVCO;LDC (SR/SGR/DBR);STC (SR); RTE;LDTLB; ICBI; PREFI;TAS; TRAPA; SLEEP

LS

MOV (load/store);

MOVA; MOVCA;

FMOV; FLDI0; FLDI1;

FABS; FNEG;

FLDS; FSTS; LDS; STS;

LDC (except SR/SGR/DBR);

STC (except SR);

OCBI; OCBP; OCBWB; PREF

TABLE 1.2. Simultaneous Issue of Instructions

1.3 SH-X: A HIGHLY EFFICIENT CPU CORE

The SH-X has enhanced its performance by adopting superpipeline architecture to the base micro-architecture of the asymmetric in-order dual-issue super­scalar architecture. The operating frequency would be limited by an applied process without fundamental change of the architecture or microarchitecture. Although conventional superpipeline architecture was thought inefficient as was the conventional superscalar architecture before applying to the SH-4, the SH-X core enhanced the operating frequency with maintaining the high efficiency.

1.3.1 Microarchitecture Selections

The SH-X has seven-stage superpipeline to maintain the efficiency among various numbers of stages applied to various processors up to highly superpipelined 20 stages [45]. The conventional seven-stage pipeline degraded the cycle performance compared with the five-stage one that is popular for efficient embedded processors. Therefore, appropriate methods were chosen to enhance and recover the cycle performance with the careful trade-off judgment of performance and efficiency. Table 1.3 summarizes the selection result of the microarchitecture.

TABLE 1.3. Microarchitecture Selections of SH-X

An out-of-order issue is the popular method used by a high-end processor to enhance the cycle performance. However, it requires much hardware and is too inefficient especially for general-purpose register handling. The SH-X adopts an in-order issue except branch instructions using no general-purpose register.

The branch penalty is the serious problem for the superpipeline architecture. The SH-X adopts a branch prediction and an out-of-order branch issue, but does not adopt a more expensive way with a branch target buffer (BTB) and an incompatible way with plural instructions. The branch prediction is categorized to static and dynamic ones, and the static ones require the architecture change to insert the static prediction result to the instruction. Therefore, the SH-X adopts a dynamic one with a branch history table (BHT) and a global history.

The load/store latencies are also a serious problem, and the out-of-order issue is effective to hide the latencies, but too inefficient to adopt as mentioned above. The SH-X adopts a delayed execution and a store buffer as more efficient methods.

The selected methods are effective to reduce the pipeline hazard caused by the superpipeline architecture, but not effective to avoid a long-cycle stall caused by a cache miss for an external memory access. Such a stall could be avoided by an out-of-order architecture with large-scale buffers, but is not a serious problem for embedded systems.

1.3.2 Improved Superpipeline Structure

Figure 1.4 illustrates a conventional seven-stage superpipeline structure. The seven stages consist of 1st and 2nd instruction fetch (I1 and I2) stages and an instruction decoding (ID) stage for all the pipelines, 1st to 4th execution (E1, E2, E3, and E4) stages for the INT, LS, and FE pipelines. The FE pipeline has nine stages with two extra execution stages of E5 and E6.

FIGURE 1.4. Conventional seven-stage superpipeline structure.

A conventional seven-stage pipeline has less performance than a five-stage one by 20%. This means the performance gain of the superpipeline architecture is only 1.4 × 0.8 = 1.12 times, which would not compensate the hardware increase. The branch and load-use-conflict penalties increase by the increase of the instruction-fetch and data-load cycles, respectively. They are the main reason of the 20% performance degradation.

Figure 1.5 illustrates the seven-stage superpipeline structure of the SH-X with delayed execution, store buffer, out-of-order branch, and flexible forwarding. Compared with the conventional pipeline shown in Figure 1.4, the INT pipeline starts its execution one-cycle later at the E2 stage, a store data is buffered to the store buffer at the E4 stage and stored to the data cache at the E5 stage, the data transfer of the floating-point unit (FPU) supports flexible forwarding. The BR pipeline starts at the ID stage, but is not synchronized to the other pipelines for an out-of-order branch issue.

FIGURE 1.5. Seven-stage superpipeline structure of SH-X.

The delayed execution is effective to reduce the load-use conflict as Figure 1.6 illustrates. It also lengthens the decoding stages into two except for the address calculation, and relaxes the decoding time. With the conventional architecture shown in Figure 1.4, a load instruction, MOV.L, set ups an R0 value at the ID stage, calculates a load address at the E1 stage, loads a data from the data cache at the E2 and E3 stages, and the load data is available at the end of the E3 stage. An ALU instruction, ADD, setups R1 and R2 values at the ID stage, adds the values at the E1 stage. Then the load data is forwarded from the E3 stage to the ID stage, and the pipeline stalls two cycles. With the delayed execution, the load instruction execution is the same, and the add instruction setups R1 and R2 values at E1 stage, adds the values at the E2 stage. Then the load data is forwarded from the E3 stage to the E1 stage, and the pipeline stalls only one cycle. This is the same cycle as those of conventional five-stage pipeline structures.

FIGURE 1.6. Load-use conflict reduction by delayed execution.

As illustrated in Figure 1.5, a store instruction performs an address calculation, TLB (translation lookaside buffer) and cache-tag accesses, a store-data latch, and a data store to the cache at the E1, E2, E4, and E5 stages, respectively, whereas a load instruction performs a cache access at the E2 stage. This means the three-stage gap of the cache access timing between the E2 and the E5 stages of a load and a store. However, a load and a store use the same port of the cache. Therefore, a load instruction gets the priority to a store instruction if the access is conflicted, and the store instruction must wait the timing with no conflict. In the N-stage gap case, N entries are necessary for the store buffer to treat the worst case, which is a sequence of N consecutive store issues followed by N consecutive load issues, and the SH-X implemented three entries.

1.3.3 Branch Prediction and Out-of-Order Branch Issue

Figure 1.7 illustrates a branch execution sequence of the SH-X before branch acceleration with a program sequence consisting of compare, conditional-branch, delay-slot, and branch-target instructions.

FIGURE 1.7. Branch execution sequence before branch acceleration.

The conditional-branch and delay-slot instructions are issued three cycles after the compare instruction issue, and the branch-target instruction is issued three cycles after the branch issue. The compare operation starts at the E2 stage by the delayed execution, and the result is available at the middle of the E3 stage. Then the conditional-branch instruction checks the result at the latter half of the ID stage, and generates the target address at the same ID stage, followed by the I1 and I2 stages of the target instruction. As a result, eight empty issue slots or four stall cycles are caused as illustrated. This means only one-third of the issue slots are used for the sequence.

Figure 1.8 illustrates the execution sequence of the SH-X after branch acceleration. The branch operation can start with no pipeline stall by a branch prediction, which predicts the branch direction that the branch is taken or not taken. However, this is not early enough to make the empty issue slots zero. Therefore, the SH-X adopted an out-of-order issue to the branches using no general-purpose register.

FIGURE 1.8. Branch execution sequence of SH-X.

The SH-X fetches four instructions per cycle, and issues two instructions at most. Therefore, Instructions are buffered in an instruction queue (IQ) as illustrated. A branch instruction is searched from the IQ or an instruction-cache output at the I2 stage and provided to the ID stage of the branch pipeline for the out-of-order issue earlier than the other instructions provided to the ID stage in order. Then the conditional branch instruction is issued right after it is fetched while the preceding instructions are in the IQ, and the issue becomes early enough to make the empty issue slots zero. As a result, the target instruction is fetched and decoded at the ID stage right after the delay-slot instruction. This means no branch penalty occurs in the sequence when the preceding or delay-slot instructions stay two or more cycles in the IQ.

The compare result is available at the E3 stage, and the prediction is checked if it is hit or miss. In the miss case, the instruction of the correct flow is decoded at the ID stage right after the E3 stage, and two-cycle stall occurs. If the correct flow is not held in the IQ, the miss-prediction recovery starts from the I1 stage, and takes two more cycles.

Historically, the dynamic branch prediction method started from a BHT with 1-bit history per entry, which recorded a branch direction of taken or not for the last time, and predicted the same branch direction. Then, a BHT with 2-bit history per entry became popular, and the four direction states of strongly taken, weekly taken, weekly not taken, and strongly not taken were used for the prediction to reflect the history of several times. There were several types of the state transitions, including a simple up-down transition. Since each entry held only one or two bits, it is too expensive to attach a tag consisting of a part of the branch-instruction address, which was usually about 20 bits for a 32-bit addressing. Therefore, we could increase the number of entries about ten or twenty times without the tag. Although the different branch instructions could not be distinguished without the tag and there occurred a false hit, the merit of the entry increase exceeded the demerit of the false hit. A global history method was also popular for the prediction, and usually used with the 2-bit/entry BHT.

The SH-X stalled only two cycles for the prediction miss, and the performance was not so sensitive to the hit ratio. Further, the one-bit method required a state change only for a prediction miss, and it could be done during the stall. Therefore, the SH-X adopted a dynamic branch prediction method with a 4K-entry 1-bit/entry BHT and a global history. The size was much smaller than the instruction and data caches of 32 kB each.

1.3.4 Low Power Technologies

The SH-X achieved excellent power efficiency by using various low-power technologies. Among them, hierarchical clock gating and pointer-controlled pipeline are explained in this section. Figure 1.9 illustrates a conventional clock-gating method. In this example, the clock tree has four levels with A-, B-, C-, and D-drivers. The A-driver receives the clock from the clock generator, and distributes the clock to each module in the processor. Then, the B-driver of each module receives the clock and distributes it to various submodules, including 128–256 flip-flops (F/Fs). The B-driver gates the clock with the signal from the clock control register, whose value is statically written by software to stop and start the modules. Next, the C- and D-drivers distribute the clock hierarchically to the leaf F/Fs with a Control Clock Pin (CCP). The leaf F/Fs are gated by hardware with the CCP to avoid activating them unnecessarily. However, the clock tree in the module is always active while the module is activated by software.

FIGURE 1.9. Conventional clock-gating method. CCP, control clock pin; GCKD, gated clock driver cell.

Figure 1.10 illustrates the clock-gating method of the SH-X. In addition to the clock gating at the B-driver, the C-drivers gate the clock with the signals dynamically generated by hardware to reduce the clock tree activity. As a result, the clock power is 30% less than that of the conventional method.

FIGURE 1.10. Clock-gating method of SH-X. CCP, control clock pin; GCKD, gated clock driver cell.

The superpipeline architecture improved operating frequency, but increased number of F/Fs and power. Therefore, one of the key design considerations was to reduce the activity ratio of the F/Fs. To address this issue, a pointer-controlled pipeline was developed. It realizes a pseudo pipeline operation with a pointer control. As shown in Figure 1.11a, three pipeline F/Fs are connected in parallel, and the pointer is used to show which F/Fs correspond to which stages. Then, only one set of F/Fs is updated in the pointer-controlled pipeline, while all pipeline F/Fs are updated every cycle in the conventional pipeline, as shown in Figure 1.11b.

FIGURE 1.11. F/Fs of (a) pointer-controlled and (b) conventional pipelines.

Table 1.4 shows the relationship between F/Fs FF0-FF2 and pipeline stages E2-E4 for each pointer value. For example, when the pointer indexes zero, the FF0 holds an input value at E2 and keeps it for three cycles as E2, E3, and E4 latches until the pointer indexes zero again and the FF0 holds a new input value. This method is good for a short latency operation in a long pipeline. The power of pipeline F/Fs decreases to 1/3 for transfer instructions, and decreases by an average of 25% as measured using Dhrystone 2.1.

TABLE 1.4. Relationship of F/Fs and Pipeline Stages

1.3.5 Performance and Efficiency Evaluations

The SH-X performance was measured using the Dhrystone 2.1 benchmark, as well as those of the SH-3 and the SH-4. The Dhrystone is a popular benchmark for evaluating integer performance of embedded processors. It is small enough to fit all the program and data into the caches, and to use at the beginning of the processor development. Therefore, only the processor core architecture can be evaluated without the influence from the system level architecture, and the evaluation result can be fed back to the architecture design. On the contrary, the system level performance cannot be measured considering cache miss rates, external memory access throughput and latencies, and so on. The evaluation result includes compiler performance because the Dhrystone benchmark is described in C language.

Figure 1.12 shows the evaluated result of the cycle performance, architectural performance, and actual performance. Starting from the SH-3, five major enhancements were adopted to construct the SH-4 microarchitecture. The SH-3 achieved 1.0 MIPS/MHz when it was released, and the SH-4 compiler enhanced its performance to 1.1. The cycle performance of the SH-4 was enhanced to 1.81 MIPS/MHz by Harvard architecture, superscalar architecture, adding BO group, early-stage branch, and zero-cycle MOV operation. The SH-4 enhanced the cycle performance by 1.65 times form the SH-3, excluding the compiler contribution. The SH-3 was a 60-MHz processor in a 0.5-µm process, and estimated to be a 133-MHz processor in a 0.25-µm process. The SH-4 achieved 200 MHz in the same 0.25-µm process. Therefore, SH-4 enhanced the frequency by 1.5 times form the SH-3. As a result, the architectural performance of the SH-4 is 1.65 × 1.5 = 2.47 times as high as that of the SH-3.

FIGURE 1.12. Performance improvement of SH-4 and SH-X.

With adopting a conventional seven-stage superpipeline, the performance was decreased by 18% to 1.47 MIPS/MHz. Branch prediction, out-of-order branch issue, store buffer and delayed execution of the SH-X improve the cycle performance by 23%, and recover the 1.8 MIPS/MHz. Since 1.4 times high operating frequency was achieved by the superpipeline architecture, the architectural performance of the SH-X was also 1.4 times as high as that of the SH-4. The actual performance of the SH-X was 720 MIPS at 400 MHz in a 0.13-µm process, and improved by two times from the SH-4 in a 0.25-µm process.

Figures 1.13 and 1.14 show the area and power efficiency improvements, respectively. The upper three graphs of both the figures show architectural performance, relative area/power, and architectural area–/power–performance ratio. The lower three graphs show actual performance, area/power, and area–/power–performance ratio.

FIGURE 1.13. Area efficiency improvement of SH-4 and SH-X.

FIGURE 1.14. Power efficiency improvement of SH-4 and SH-X.

The area of the SH-X core was 1.8 mm2 in a 0.13-µm process, and the area of the SH-4 was estimated as 1.3 mm2 if it was ported to a 0.13-µm process. Therefore, the relative area of the SH-X was 1.4 times as much as that of the SH-4, and 2.26 times as much as the SH-3. Then, the architectural area efficiency of the SH-X was nearly equal to that of the SH-4, and 1.53 times as high as the SH-3. The actual area efficiency of the SH-X reached 400 MIPS/mm2, which was 8.5 times as high as the 74 MIPS/ mm2 of the SH-4.

SH-4 was estimated to achieve 200 MHz, 360 MIPS with 140 mW at 1.15 V, and 280 MHz, 504 MIPS with 240 mW at 1.25 V. The power efficiencies were 2,500 and 2,100 MIPS/W, respectively. On the other hand, SH-X achieved 200 MHz, 360 MIPS with 80 mW at 1.0 V, and 400 MHz, 720 MIPS with 250 mW at 1.25 V. The power efficiencies were 4,500 and 2,880 MIPS/W, respectively. As a result, the power efficiency of the SH-X improved by 1.8 times from that of the SH-4 at the same frequency of 200 MHz, and by 1.4 times at the same supply voltage with enhancing the performance by 1.4 times. These were architectural improvements, and actual improvements were multiplied by the process porting.

1.4 SH-X FPU: A HIGHLY EFFICIENT FPU

The floating-point architecture and microarchitecture of the SH processors achieve high multimedia performance and efficiency. An FPU of the SH processor is highly parallel with keeping the efficiency for embedded systems in order to compensate the insufficient parallelism of the dual-issue superscalar architecture for highly parallel applications like 3D graphics.

In late 1990s, it became difficult to support higher resolution and advanced features of the 3D graphics. It was especially difficult to avoid overflow and underflow of fixed-point data with small dynamic range, and there was a demand to use floating-point data. Since it was easy to implement a four-way parallel operation with fixed-point data, equivalent performance had to be realized to change the data type to the floating-point format at reasonable costs.

Since an FPU was about three times as large as a fixed-point unit, and a four-way SMID required four times as large a datapath, it was too expensive to integrate a four-way SMID FPU. The latency of the floating-point operations was long, and required more number of registers than the fixed-point operations. Therefore, efficient parallelization and latency-reduction methods had to be developed.

1.4.1 FPU Architecture of SH Processors

Sixteen is the limit of the number of registers directly specified by the 16-bit fixed-length ISA, but the SH FPU architecture defines 32 registers as two banks of 16 registers. The two banks are front and back banks, named FR0-FR15 and XF0-XF15, respectively, and they are switched by changing a control bit FPSCR.FR in a floating-point status and control register (FPSCR). Most of instructions use only the front bank, but some instructions use both the front and back banks. The front bank registers are used as eight pairs or four length-4 vectors as well as 16 registers, and the back bank registers are used as eight pairs or a four-by-four matrix. They are defined as follows:

Since an ordinary SIMD architecture of an FPU is too expensive for an embedded processor as described above, another parallelism is applied to the SH processors. The large hardware of an FPU is for a mantissa alignment before the calculation and normalization and rounding after the calculation. Further, a popular FPU instruction, FMAC, requires three read and one write ports. The consecutive FMAC operations are a popular sequence to accumulate plural products. For example, an inner product of two length-4 vectors is one of such sequences, and popular in a 3D graphics program. Therefore, a floating-point inner-product instruction (FIPR) is defined to accelerate the sequence with smaller hardware than that for the SIMD. It uses the two of four length-4 vectors as input operand, and modifies the last register of one of the input vectors to store the result. The defining formula is as follows:

This modifying-type definition is similar to the other instructions. However, for a length-3 vector operation, which is also popular, you can get the result without destroying the inputs, by setting one element of the input vectors to zero.

The FIPR produces only one result, which is one-fourth of a four-way SIMD, and can save the normalization and rounding hardware. It requires eight input and one output registers, which are less than the 12 input and four output registers for a four-way SIMD FMAC. Further, the FIPR takes much shorter time than the equivalent sequence of one FMUL and three FMACs, and requires small number of registers to sustain the peak performance. As a result, the hardware is about half of the four-way SIMD.

The rounding rule of the conventional floating-point operations is strictly defined by an American National Standards Institute/Institute of Electrical and Electronics Engineers (ANSI/IEEE) 754 floating-point standard. The rule is to keep accurate values before rounding. However, each instruction performs the rounding, and the accumulated rounding error sometimes becomes very serious. Therefore, a program must avoid such a serious rounding error without relying to hardware if necessary. The sequence of one FMUL and three FMACs can also cause a serious rounding error. For example, the following formula results in zero if we add the terms in the order of the formula by FADD instructions:

However, the exact value is 1  FFFFFE × 2103, and the error is 1  FFFFFE × 2103 for the formula, which causes the worst error of 2−23 times of the maximum term. We can get the exact value if we change the operation order properly. The floating-point standard defines the rule of each operation, but does not define the result of the formula, and either of the result is fine for the conformance. Since the FIPR operation is not defined by the standard, we defined its max­imum error as “2E−25 + rounding error of result” to make it better than or equal to the average and worst-case errors of the equivalent sequence that conforms the standard, where E is the maximum exponent of the four products.

A length-4 vector transformation is also popular operation of a 3D graphics, and a floating-point transform vector instruction (FTRV) is defined. It requires 20 registers to specify the operands in a modification type definition. Therefore, the defining formula is as follows, using a four-by-four matrix of all the back bank registers, XMTRX, and one of the four front-bank vector registers, FV0-FV3:

Since a 3D object consists of a lot of polygons expressed by the length-4 vectors, and one XMTRX is applied to a lot of the vectors of a 3D object, the XMTRX is not so often changed, and is suitable for using the back bank. The FTRV operation is implemented as four inner-product operations by dividing the XMTRX into four vectors properly, and its maximum error is the same as the FIPR.

The newly defined FIPR and FTRV can enhance the performance, but data transfer ability becomes a bottleneck to realize the enhancement. Therefore, a pair load/store/transfer mode is defined to double the data move ability. In the pair mode, floating-point move instructions (FMOVs) treat 32 front- and back-bank floating-point registers as 16 pairs, and directly access all the pairs without the bank switch controlled by the FPSCR.FR bit. The mode switch between the pair and normal modes is controlled by a move-size bit FPSCR.SZ in the FPSCR.

The 3D graphics requires high performance but uses only a single precision. On the other hand, a double precision format is popular for server/PC market, and would eases a PC application porting to a handheld PC. Although the performance requirement is not so high as the 3D graphics, software emulation is too slow compared with hardware implementation. Therefore, the SH architecture has single- and double-precision modes, which are controlled by a precision bit FPSCR.PR of the FPSCR. Further, a floating-point register-bank, move-size, and precision change instructions (FPCRG, FSCHG, and FRCHG) were defined for fast changes of the modes defined above. This definition can save the small code space of the 16-bit fixed length ISA. Some conversion operations between the precisions are necessary, but not fit to the mode separation. Therefore, SH architecture defines two conversion instructions in the double-precision mode. An FCNVSD converts a single-precision data to a double-precision one, and an FCNVDS converts vice versa. In the double-precision mode, eight pairs of the front-bank registers are used for double-precision data, and one 32-bit register, FPUL, is used for a single-precision or integer data, mainly for the conversion.

The FDIV and floating-point square-root instruction (FSQRT) are long latency instructions, and could cause serious performance degradations. The long latencies are mainly from the strict operation definitions by the ANSI/IEEE 754 floating-point standard. We have to keep accurate value before rounding. However, there is another way if we allow proper inaccuracies.

A floating-point square-root reciprocal approximate (FSRRA) is defined as an elementary function instruction to replace the FDIV, FSQRT, or their combination. Then we do not need to use the long latency instructions. 3D graphics applications especially require a lot of reciprocal and square-root reciprocal values, and the FSRRA is highly effective. Further, 3D graphics require less accuracy, and the single-precision without strict rounding is enough accuracy. The maximum error of the FSRRA is ±2E−21, where E is the exponent value of an FSRRA result. The FSRRA definition is as follows:

A floating-point sine and cosine approximate (FSCA) is defined as another popular elementary function instruction. Once the FSRRA is introduced, extra hardware is not so large for the FSCA. The most poplar definition of the trigonometric function is to use radian for the angular unit. However, the period of the radian is 2π, and cannot be expressed by a simple binary num­ber. Therefore, the FSCA uses fixed-point number of rotations as the angular expression. The number consists of 16-bit integer and 16-bit fraction parts. Then the integer part is not necessary to calculate the sine and cosine values by their periodicity, and the 16-bit fraction part can express enough resolution of 360/65,536 = 0.0055°. The angular source operand is set to a CPU-FPU communication register FPUL because the angular value is a fixed-point number. The maximum error of the FSCA is ±2−22, which is an absolute value and not related to the result value. Then the FSCA definition is as follows:

1.4.2 Implementation of SH-X FPU

Table 1.5 shows the pitches and latencies of the FE-category instructions of the SH-3E, SH-4, and SH-X. As for the SH-X, the simple single-precision instructions of FADD, FSUB, FLOAT, and FTRC, have three-cycle latencies. Both single- and double-precision FCMPs have two-cycle latencies. Other single-precision instructions of FMUL, FMAC, and FIPR, and the double-precision instructions, except FMUL, FCMP, FDIV, and FSQRT, have five-cycle latencies. All the above instructions have one-cycle pitches.

TABLE 1.5. Pitch/Latency of FE-Category Instructions

The FTRV consists of four FIPR like operations resulting in four-cycle pitch and eight-cycle latency. The FDIV and FSQRT are out-of-order completion instructions having two-cycle pitches for the first and last cycles to initiate a special resource operation and to perform postprocesses of the result. Their pitches of the special resource expressed in the parentheses are about halves of the mantissa widths, and the latencies are four cycles more than the special-resource pitches. The FSRRA has one-cycle pitch, three-cycle pitch of the special resource, and five-cycle latency. The FSCA has three-cycle pitch, five-cycle pitch of the special resource, and seven-cycle latency. The double-precision FMUL has three-cycle pitch and seven-cycle latency.

Multiply–accumulate (MAC) is one of the most frequent operations in intensive computing applications. The use of four-way SIMD can achieve the same throughput as the FIPR, but the latency is longer and the register file has to be larger. Figure 1.15 illustrates an example of the differences according to the pitches and latencies of the FE-category SH-X instructions shown in Table 1.5. In this example, each box shows an operation issue slot. Since FMUL and FMAC have five-cycle latencies, we must issue 20 independent operations for peak throughput in the case of four-way SIMD. The result is available 20 cycles after the FMUL issue. On the other hand, five independent operations are enough to get the peak throughput of a program using FIPRs. Therefore, FIPR requires one-quarter of the program’s parallelism and registers.

FIGURE 1.15. Four-way SIMD versus FIPR.

Figure 1.16 compares the pitch and latency of an FSRRA and the equivalent sequence of an FSQRT and an FDIV according to Table 1.5. Each of the FSQRT and FDIV occupies 2 and 13 cycles of the MAIN FPU and special resources, respectively, and takes 17 cycles to get the result, and the result is available 34 cycles after the issue of the FSQRT. In contrast, the pitch and latency of the FSRRA are one and five cycles that are only one-quarter and approximately one-fifth of those of the equivalent sequences, respectively. The FSRRA is much faster using a similar amount of the hardware resource.

FIGURE 1.16. FSRRA versus equivalent sequence of FSQRT and FDIV.

The FSRRA can compute a reciprocal as shown in Figure 1.17. The FDIV occupies 2 and 13 cycles of the MAIN FPU and special resources, respectively, and takes 17 cycles to get the result. On the other hand, the FSRRA and FMUL sequence occupies 2 and 3 cycles of the MAIN FPU and special resources, respectively, and takes 10 cycles to get the result. Therefore, the FSRRA and FMUL sequence is better than using the FDIV if an application does not require a result conforming to the IEEE standard, and 3D graphics is one of such applications.

FIGURE 1.17. FDIV versus equivalent sequence of FSRRA and FMUL.

Figure 1.18