Build Your Own Programming Language - Clinton  L. Jeffery - E-Book

Build Your Own Programming Language E-Book

Clinton L. Jeffery

0,0
35,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

There are many reasons to build a programming language: out of necessity, as a learning exercise, or just for fun. Whatever your reasons, this book gives you the tools to succeed.
You’ll build the frontend of a compiler for your language and generate a lexical analyzer and parser using Lex and YACC tools. Then you’ll explore a series of syntax tree traversals before looking at code generation for a bytecode virtual machine or native code. In this edition, a new chapter has been added to assist you in comprehending the nuances and distinctions between preprocessors and transpilers. Code examples have been modernized, expanded, and rigorously tested, and all content has undergone thorough refreshing. You’ll learn to implement code generation techniques using practical examples, including the Unicon Preprocessor and transpiling Jzero code to Unicon. You'll move to domain-specific language features and learn to create them as built-in operators and functions. You’ll also cover garbage collection.
Dr. Jeffery’s experiences building the Unicon language are used to add context to the concepts, and relevant examples are provided in both Unicon and Java so that you can follow along in your language of choice.
By the end of this book, you'll be able to build and deploy your own domain-specific language.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 760

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Build Your Own Programming Language

Second Edition

A programmer’s guide to designing compilers, interpreters, and DSLs for modern computing problems

Clinton L. Jeffery

BIRMINGHAM—MUMBAI

Build Your Own Programming Language

Second Edition

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Senior Publishing Product Manager: Denim Pinto

Acquisition Editor: Peer Reviews: Gaurav Gavas

Project Editor: Parvathy Nair

Content Development Editor: Elliot Dallow

Copy Editor: Safis Editing

Technical Editor: Aneri Patel

Proofreader: Safis Editing

Indexer: Hemangini Bari

Presentation Designer: Ajay Patule

Developer Relations Marketing Executive: Vidhi Vashisth

First published: December 2021

Second edition: January 2024

Production reference: 3150524

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-80461-802-8

www.packt.com

Foreword

In the dynamic world of computer science, the creation of a programming language stands as a testament to ingenuity and a deep understanding of computational principles. Build Your Own Programming Language is not just a guide; it is an invitation to delve into the complexity and beauty of programming language creation.

At the helm of this voyage is Clinton L. Jeffery, a distinguished professor and Chair of the Department of Computer Science and Engineering at the New Mexico Institute of Mining and Technology. His academic journey, marked by degrees from the University of Washington and the University of Arizona, has been a path of relentless exploration in the realms of programming languages, program monitoring, and visualization, among others. His work culminates in the creation of the Unicon programming language, a testament to his expertise and vision.

This book is structured to guide the reader through the nuanced process of developing a programming language. Beginning with motivations and types of language implementations, Jeffery sets the stage for understanding the fundamental “why” behind language design. He intricately discusses organizing a bytecode language and differentiates between programming languages and libraries, laying a solid foundation for both novices and experienced programmers.

The detailed chapters delve into the heart of language design, parsing, and the construction of syntax trees, with practical examples and case studies like the development of Unicon and the Jzero language. Jeffery’s approach is meticulous, ensuring that readers grasp the essentials of technical requirements, lexical categories, context-free grammar, and symbol tables. This comprehensive coverage ensures that readers are not just following instructions but are truly understanding the principles at play.

What makes this book exceptional is its blend of theoretical knowledge and practical application. Jeffery does not shy away from the complexities of designing graphics facilities or tackling syntax trees and symbol tables. Instead, he embraces these challenges, guiding the reader with clarity and insight. The inclusion of questions at the end of each chapter prompts critical thinking and reflection, reinforcing the overall learning experience.

As you progress through Build Your Own Programming Language, you will find yourself not just acquiring knowledge, but also developing a new perspective on programming languages. They are not merely tools for tasks but are expressive mediums that reflect human creativity and problem-solving skills.

Clinton L. Jeffery, with his extensive experience and pioneering work in Unicon, provides a comprehensive and enlightening guide for anyone interested in the art and science of programming language development. Whether you are a student, a professional programmer, or an enthusiast of computer science, this book is a beacon, illuminating the path to understanding and creating your own programming language.

Welcome to a journey of discovery, creativity, and technical mastery in the world of programming languages!

Imran Ahmad, PhD

Senior Data Scientist, Canadian Federal Government

Contributors

About the author

Clinton L. Jeffery is Professor and Chair of the Department of Computer Science and Engineering at the New Mexico Institute of Mining and Technology. He received his B.S. from the University of Washington, and M.S. and Ph.D. degrees from the University of Arizona, all in computer science. He has conducted research and written many books and papers on programming languages, program monitoring, debugging, graphics, virtual environments, and visualization. With colleagues, he invented the Unicon programming language, hosted at unicon.org.

Steve Wampler, Sana Algaraibeh, and Phillip Thomas provided valuable feedback and suggestions for improving this book.

About the reviewers

Steve Wampler was awarded a Ph.D. in Computer Science by the University of Arizona. He has worked as an Associate Professor of Computer Science as well as a software designer for several major telescope projects, including the Gemini 8m telescopes project and the Daniel K Inouye solar telescope. He has been a software reviewer for a number of other major telescope systems and a technical reviewer for several other programming books.

Sana Algaraibeh was awarded her Ph.D. in Computer Science from the University of Idaho. She joined the faculty at the New Mexico Institute of Mining and Technology as a Computer Science instructor in 2022. Prior to that, she worked in academia for 14+ years as a lecturer, trainer, team leader, instructional designer, and Computer Science department chair at universities in Jordan and Saudi Arabia. She teaches Internet and Web Programming, Object-Oriented Programming, Python for Data Science, Algorithms and Data Structures, and Introduction to Programming. Her area of scholarship is computer science education and compiler error messages. She is interested in developing computational solutions integrated with modern pedagogy.

Join our community on Discord

Join our community’s Discord space for discussions with the authors and other readers:

https://discord.com/invite/zGVbWaxqbw

Contents

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Section I: Programming Language Frontends

Why Build Another Programming Language?

Motivations for writing your own programming language

Types of programming language implementations

Organizing a bytecode language implementation

Languages used in the examples

The difference between programming languages and libraries

Applicability to other software engineering tasks

Establishing the requirements for your language

Case study – requirements that inspired the Unicon language

Unicon requirement #1 – preserve what people love about Icon

Unicon requirement #2 – support large-scale programs working on big data

Unicon requirement #3 – high-level input/output for modern applications

Unicon requirement #4 – provide universally implementable system interfaces

Summary

Questions

Programming Language Design

Determining the kinds of words and punctuation to provide in your language

Specifying the control flow

Deciding on what kinds of data to support

Atomic types

Composite types

Domain-specific types

Overall program structure

Completing the Jzero language definition

Case study – designing graphics facilities in Unicon

Language support for 2D graphics

Adding support for 3D graphics

Summary

Questions

Scanning Source Code

Technical requirements

Lexemes, lexical categories, and tokens

Regular expressions

Regular expression rules

Regular expression examples

Using UFlex and JFlex

Header section

Regular expressions section

Writing a simple source code scanner

Running your scanner

Tokens and lexical attributes

Expanding our example to construct tokens

Writing a scanner for Jzero

The Jzero flex specification

Unicon Jzero code

Java Jzero code

Running the Jzero scanner

Regular expressions are not always enough

Summary

Questions

Parsing

Technical requirements

Syntax analysis

Context-free grammars

Writing context-free grammar rules

Writing rules for programming constructs

Using iyacc and BYACC/J

Declaring symbols in the header section

Advanced yacc declarations

Putting together the yacc context-free grammar section

Understanding yacc parsers

Fixing conflicts in yacc parsers

Syntax error recovery

Putting together a toy example

Writing a parser for Jzero

The Jzero lex specification

The Jzero yacc specification

Unicon Jzero code

Java Jzero parser code

Running the Jzero parser

Improving syntax error messages

Adding detail to Unicon syntax error messages

Adding detail to Java syntax error messages

Using Merr to generate better syntax error messages

Summary

Questions

Syntax Trees

Technical requirements

Using GNU Make

Learning about trees

Defining a syntax tree type

Parse trees versus syntax trees

Creating leaves from terminal symbols

Wrapping tokens in leaves

Working with YACC’s value stack

Wrapping leaves for the parser’s value stack

Determining which leaves you need

Building internal nodes from production rules

Accessing tree nodes on the value stack

Using the tree node factory method

Forming syntax trees for the Jzero language

Debugging and testing your syntax tree

Avoiding common syntax tree bugs

Printing your tree in a text format

Printing your tree using dot

Summary

Questions

Section II: Syntax Tree Traversals

Symbol Tables

Technical requirements

Establishing the groundwork for symbol tables

Declarations and scopes

Assigning and dereferencing variables

Choosing the right tree traversal for the job

Creating and populating symbol tables for each scope

Adding semantic attributes to syntax trees

Defining classes for symbol tables and symbol table entries

Creating symbol tables

Populating symbol tables

Synthesizing the isConst attribute

Checking for undeclared variables

Identifying the bodies of methods

Spotting uses of variables within method bodies

Finding redeclared variables

Inserting symbols into the symbol table

Reporting semantic errors

Handling package and class scopes in Unicon

Mangling names

Inserting self for member variable references

Inserting self as the first parameter in method calls

Testing and debugging symbol tables

Summary

Questions

Checking Base Types

Technical requirements

Type representation in the compiler

Defining a base class for representing types

Subclassing the base class for complex types

Assigning type information to declared variables

Synthesizing types from reserved words

Inheriting types into a list of variables

Determining the type at each syntax tree node

Determining the type at the leaves

Calculating and checking the types at internal nodes

Runtime type checks and type inference in Unicon

Summary

Questions

Checking Types on Arrays, Method Calls, and Structure Accesses

Technical requirements

Checking operations on array types

Handling array variable declarations

Checking types during array creation

Checking types during array accesses

Checking method calls

Calculating the parameters and return type information

Checking the types at each method call site

Checking the type at return statements

Checking structured type accesses

Handling instance variable declarations

Checking types at instance creation

Checking types of instance accesses

Summary

Questions

Intermediate Code Generation

Technical requirements

What is intermediate code?

Why generate intermediate code?

Learning about the memory regions in the generated program

Introducing data types for intermediate code

Adding the intermediate code attributes to the tree

Generating labels and temporary variables

An intermediate code instruction set

Instructions

Declarations

Annotating syntax trees with labels for control flow

Generating code for expressions

Generating code for control flow

Generating label targets for condition expressions

Generating code for loops

Generating intermediate code for method calls

Reviewing the generated intermediate code

Summary

Questions

Syntax Coloring in an IDE

Writing your own IDE versus supporting an existing one

Downloading the software used in this chapter

Adding support for your language to Visual Studio Code

Configuring Visual Studio Code to do Syntax Highlighting for Jzero

Visual Studio Code extensions using the JSON format

JSON atomic types

JSON collections

File organization for Visual Studio Code extensions

The extensions file

The extension manifest

Writing IDE tokenization rules using TextMate grammars

Integrating a compiler into a programmer’s editor

Analyzing source code from within the IDE

Sending compiler output to the IDE

Avoiding reparsing the entire file on every change

Using lexical information to colorize tokens

Extending the EditableTextList component to support color

Coloring individual tokens as they are drawn

Highlighting errors using parse results

Summary

Questions

Section III: Code Generation and Runtime Systems

Preprocessors and Transpilers

Understanding preprocessors

A preprocessing example

Identity preprocessors and pretty printers

The preprocessor within the Unicon preprocessor

Code generation in the Unicon preprocessor

Transforming objects into classes

Generating source code from the syntax tree

Closure-based inheritance in Unicon

The difference between preprocessors and transpilers

Transpiling Jzero code to Unicon

Semantic attributes for transpiling to Unicon

A code generation model for Jzero

The Jzero to Unicon transpiler code generation method

Transpiling the base cases: names and literals

Handling the dot operator

Mapping Java expressions to Unicon

Transpiler code for method calls

Assignments

Transpiler code for control structures

Transpiling Jzero declarations

Transpiling Jzero block statements

Transpiling a Jzero class into a Unicon package that contains a class

Summary

Questions

Bytecode Interpreters

Technical requirements

Understanding what bytecode is

Comparing bytecode with intermediate code

Building a bytecode instruction set for Jzero

Defining the Jzero bytecode file format

Understanding the basics of stack machine operation

Implementing a bytecode interpreter

Loading bytecode into memory

Initializing the interpreter state

Fetching instructions and advancing the instruction pointer

Instruction decoding

Executing instructions

Starting up the Jzero interpreter

Writing a runtime system for Jzero

Running a Jzero program

Examining iconx, the Unicon bytecode interpreter

Understanding goal-directed bytecode

Leaving type information in at runtime

Fetching, decoding, and executing instructions

Crafting the rest of the runtime system

Summary

Questions

Generating Bytecode

Technical requirements

Converting intermediate code to Jzero bytecode

Adding a class for bytecode instructions

Mapping intermediate code addresses to bytecode addresses

Implementing the bytecode generator method

Generating bytecode for simple expressions

Generating code for pointer manipulation

Generating bytecode for branches and conditional branches

Generating code for method calls and returns

Handling labels and other pseudo-instructions in intermediate code

Comparing bytecode assembler with binary formats

Printing bytecode in assembler format

Printing bytecode in binary format

Linking, loading, and including the runtime system

Unicon example – bytecode generation in icont

Summary

Questions

Native Code Generation

Technical requirements

Deciding whether to generate native code

Introducing the x64 instruction set

Adding a class for x64 instructions

Mapping memory regions to x64 register-based address modes

Using registers

Starting from a null strategy

Assigning registers to speed up the local region

Converting intermediate code to x64 code

Mapping intermediate code addresses to x64 locations

Implementing the x64 code generator method

Generating x64 code for simple expressions

Generating code for pointer manipulation

Generating native code for branches and conditional branches

Generating code for method calls and returns

Handling labels and pseudo-instructions

Generating x64 output

Writing the x64 code in assembly language format

Going from native assembler to an object file

Linking, loading, and including the runtime system

Summary

Questions

Leave a review!

Implementing Operators and Built-In Functions

Implementing operators

Comparing adding operators to adding new hardware

Implementing string concatenation in intermediate code

Adding String concatenation to the bytecode interpreter

Adding String concatenation to the native runtime system

Writing built-in functions

Adding built-in functions to the bytecode interpreter

Writing built-in functions for use with the native code implementation

Integrating built-ins with control structures

Developing operators and functions for Unicon

Writing operators in Unicon

Developing Unicon’s built-in functions

Summary

Questions

Domain Control Structures

Knowing when a new control structure is needed

Scanning strings in Icon and Unicon

Scanning environments and their primitive operations

Eliminating excessive parameters via a control structure

Rendering regions in Unicon

Rendering 3D graphics from a display list

Specifying rendering regions using built-in functions

Varying levels of detail using nested rendering regions

Creating a rendering region control structure

Adding a reserved word for rendering regions

Adding a grammar rule

Checking wsection for semantic errors

Generating code for a wsection control structure

Summary

Questions

Garbage Collection

Grasping the importance of garbage collection

Counting references to objects

Adding reference counting to Jzero

Reducing the number of heap allocations for strings

Modifying the generated code for the assignment operator

Modifying the generated code for method call and return

The drawbacks and limitations of reference counting

Marking live data and sweeping the rest

Organizing heap memory regions

Traversing the basis to mark live data

Marking the block region

Reclaiming live memory and placing it into contiguous chunks

Summary

Questions

Final Thoughts

Reflecting on what was learned from writing this book

Deciding where to go from here

Studying programming language design

Learning about implementing interpreters and bytecode machines

Acquiring expertise in code optimization

Monitoring and debugging program executions

Designing and implementing IDEs and GUI builders

Exploring references for further reading

Studying programming language design

Learning about implementing interpreters and bytecode machines

Acquiring expertise in native code and code optimization

Monitoring and debugging program executions

Designing and implementing IDEs and GUI builders

Summary

Section IV: Appendix

Appendix: Unicon Essentials

Syntactic shorthand

Running Unicon

Using Unicon’s declarations and data types

Declaring program components

Using atomic data types

Numeric

Textual

Aggregating multiple values using structure types

Classes

Lists

Tables

Sets

Files

Other types

Evaluating expressions

Forming expressions using operators

Invoking procedures, functions, and methods

Iterating and selecting what and how to execute

Generators

Debugging and environmental issues

Learning the basics of the UDB debugger

Environment variables

Preprocessor

Preprocessor commands

Built-in macro definitions

Function mini-reference

Selected keywords

Answers

Other Books You May Enjoy

Index

Landmarks

Cover

Index

Share your thoughts

Once you’ve read Build Your Own Programming Language, Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application. 

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781804618028

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Section I

Programming Language Frontends

In this section, you will create a basic language design and implement the frontend of a compiler for it, including a lexical analyzer and a parser that builds a syntax tree from an input source file.

This section comprises the following chapters:

Chapter 1, Why Build Another Programming Language?Chapter 2, Programming Language DesignChapter 3, Scanning Source CodeChapter 4, ParsingChapter 5, Syntax Trees

1

Why Build Another Programming Language?

This book will show you how to build your own programming language, but first, you should ask yourself, why would I want to do this? For a few of you, the answer will be simple: because it is so much fun. However, for the rest of us, it is a lot of work to build a programming language, and we need to be sure about it before we make that kind of effort. Do you have the patience and persistence that it takes?

This chapter points out a few good reasons to build your own programming language, as well as some circumstances in which you don’t need to build your contemplated language. After all, designing a class library for your application domain is often simpler and just as effective. However, libraries have their limitations, and sometimes, only a new language will do.

After this chapter, the rest of this book will take for granted that, having considered things carefully, you have decided to build a language. But first, we’re going to consider our initial options by covering the following main topics in this chapter:

Motivations for writing your own programming languageTypes of programming language implementationsOrganizing a bytecode language implementationLanguages used in the examplesThe difference between programming languages and librariesApplicability to other software engineering tasksEstablishing the requirements for your languageCase study – requirements that inspired the Unicon language

Let’s start by looking at motivations.

Motivations for writing your own programming language

Sure, some programming language inventors are rock stars of computer science, such as Dennis Ritchie or Guido van Rossum! Becoming a rock star in computer science was easier back in the previous century. In 1993, I heard the following report from an attendee of the second ACM History of Programming Languages Conference: “The consensus was that the field of programming languages is dead. All the important languages have been invented already.” This was proven wildly wrong a year or two later when Java hit the scene, and perhaps a dozen times since then when important languages such as Go emerged. After a mere six decades, it would be unwise to claim our field is mature and that there’s nothing new to invent that might make you famous.

In any case, celebrity is a bad reason to build a programming language. The chances of acquiring fame or fortune from your programming language invention are slim. Curiosity and a desire to know how things work are valid reasons, so long as you’ve got the time and inclination, but perhaps the best reason to build your own programming language is necessity.

Some folks need to build a new language, or a new implementation of an existing programming language, to target a new processor or compete with a rival company. If that’s not you, then perhaps you’ve looked at the best languages (and compilers or interpreters) available for some domain that you are developing programs for, and they are missing some key features for what you are doing, and those missing features are causing you pain. This is the stuff Master’s theses and PhD dissertations are made of. Every once in a blue moon, someone comes up with a whole new style of computing for which a new programming paradigm requires a new language.

While we are discussing your motivations for building a language, let’s also talk about the different kinds of languages, how they are organized, and the examples this book will use to guide you.

Types of programming language implementations

Whatever your reasons, before you build a programming language, you should pick the best tools and technologies you can find to do the job. In our case, this book will pick them for you. First, there is a question of the implementation language, which is to say, the language that you are building your language in.

Programming language academics like to brag about writing their language in that language itself, but this is usually only a half-truth (or someone was being very impractical and showing off at the same time). There is also the question of just what kind of programming language implementation to build:

A pure interpreter that executes the source code itselfA native compiler and a runtime system, such as in CA transpiler that translates your language into some other high-level languageA bytecode compiler with an accompanying bytecode machine, such as in Java

The first option is fun, but the resulting language is usually too slow to satisfy real-world project requirements. The second option is often optimal, but may be too labor-intensive; a good native compiler may take years of effort.

The third option is by far the easiest and probably the most fun, and I have used it before with good success. Don’t discount a transpiler implementation as a kind of cheating, but do be aware that it has its problems. The first version of C++, AT&T’s cfront tool, was a transpiler, but that gave way to compilers, and not just because cfront was buggy. Strangely, generating high-level code seems to make your language even more dependent on the underlying language than the other options, and languages are moving targets. Good languages have died because their underlying dependencies disappeared or broke irreparably on them. It can be the death of a thousand cuts.

For the most part, this book focuses on the fourth option; over the course of several chapters, we will build a bytecode compiler with an accompanying bytecode machine because that is a sweet spot that gives a lot of flexibility, while still offering decent performance. A chapter on transpilers and preprocessors is provided for those of you who may prefer to implement your language by generating code for another high-level language. A chapter on native code compilation is also included, for those of you who require the fastest possible execution.

The notion of a bytecode machine is very old; it was made famous by UCSD’s Pascal implementation and the classic SmallTalk-80 implementation, among others. It became ubiquitous to the point of entering lay English with the promulgation of Java’s JVM. Bytecode machines are abstract processors interpreted by software; they are often called virtual machines (as in Java Virtual Machine), although I will not use that terminology because it is also used to refer to software tools that implement real hardware instruction sets, such as IBM’s classic platforms, or more modern tools such as Virtual Box.

A bytecode machine is typically quite a bit higher level than a piece of hardware, so a bytecode implementation affords much flexibility. Let’s have a quick look at what it will take to get there…

Organizing a bytecode language implementation

To a large extent, the organization of this book follows the classic organization of a bytecode compiler and its corresponding virtual machine. These components are defined here, followed by a diagram to summarize them:

A lexical analyzer reads in source code characters and figures out how they are grouped into a sequence of words or tokens.A syntax analyzer reads in a sequence of tokens and determines whether that sequence is legal, according to the grammar of the language. If the tokens are in a legal order, it produces a syntax tree.A semantic analyzer checks to ensure that all the names being used are legal for the operations in which they are being used. It checks their types to determine exactly what operations are being performed. All this checking makes the syntax tree heavy, laden with extra information about where variables are declared and what their types are.An intermediate code generator figures out memory locations for all the variables and all the places where a program may abruptly change execution flow, such as loops and function calls. It adds them to the syntax tree and then walks this even fatter tree, before building a list of machine-independent intermediate code instructions.A final code generator turns the list of intermediate code instructions into the actual bytecode, in a file format that will be efficient to load and execute.

In addition to the steps of this bytecode virtual machine compiler, a bytecode interpreter is written to load and execute programs. It is a giant loop with a switch statement in it. For very high-level programming languages, the compiler might be no big deal, and all the magic may be in the bytecode interpreter. The whole organization can be summarized by the following diagram:

Figure 1.1: Phases and dataflow in a simple programming language

It will take a lot of code to illustrate how to build a bytecode machine implementation of a programming language. How that code is presented is important and will tell you what you need to know going in, as well as what you may learn from going through this book.

Languages used in the examples

This book provides code examples in two languages using a parallel translations model. The first language is Java because that language is ubiquitous. Hopefully, you know Java (or C++, or C#) and will be able to read the examples with intermediate proficiency. The second example language is the author’s own language, Unicon. While reading this book, you can judge for yourself which language is better suited to building programming languages. As many examples as possible are provided in both languages, and the examples in the two languages are written as similarly as possible. Sometimes, this will be to the advantage of Java, which is a bit lower level than Unicon. There are sometimes fancier or shorter ways to write things in Unicon, but our Unicon examples will stick as close to Java as possible. The differences between Java and Unicon will be obvious, but they are somewhat lessened in importance by the compiler construction tools we will use.

This book uses modern descendants of the venerable Lex and YACC tools to generate our scanner and parser. Lex and YACC are declarative programming languages that solve some of our hard problems at a higher level than Java or Unicon. It would have been nice if a modern descendant of Lex and YACC (such as ANTLR) supported both Java and Unicon, but such is not the case. One of the very cool parts of this book is this: by choosing tools for Java and Unicon that are very compatible with the original Lex and YACC and extending them a bit, we have managed to use the same lexical and syntax specifications of our compiler in both Java and Unicon!

While Java and Unicon are our implementation languages, we need to talk about one more language: the example language we are building. It is a stand-in for whatever language you decide to build. Somewhat arbitrarily, this book introduces a language called Jzero for this purpose. Niklaus Wirth invented a toy language called PL/0 (programming language zero; the name is a riff on the language name PL/1) that was used in compiler construction courses. Jzero is a tiny subset of Java that serves a similar purpose. I looked pretty hard (that is, I googled Jzero and then Jzero compiler) to see whether someone had already posted a Jzero definition we could use and did not spot one by that name, so we will just make it up as we go along.

The Java examples in this book will be tested using Java 21; maybe other recent versions of Java will work. You can get OpenJDK from http://openjdk.org, or if you are on Linux, your operating system probably has an OpenJDK package that you can install. Additional programming language construction tools (Jflex and byacc/j) that are required for the Java examples will be introduced in subsequent chapters as they are used. The Java implementations we will support might be more constrained by which versions will run these language construction tools than anything else.

The Unicon examples in this book work with Unicon version 13.3, which can be obtained from http://unicon.org. To install Unicon on Windows, you must download a .msi file and run the installer. To install on Linux, you should follow the instructions found on the unicon.org site.

Having gone through the basic organization of a programming language and the implementation that this book will use, perhaps we should take another look at when a programming language is called for, and when building one can be avoided by developing a library instead.

The difference between programming languages and libraries

Unless you are in it for the “fun” or the intellectual experience, building a programming language is a lot of work that might not be necessary. If your motives are strictly utilitarian, you don’t have to make a programming language when a library will do the job. Libraries are by far the most common way to extend an existing programming language to perform a new task. A library is a set of functions or classes that can be used together to write applications for some hardware or software technology. Many languages, including C and Java, are designed almost completely to revolve around a rich set of libraries. The language itself is very simple and general, while much of what a developer must learn to develop applications consists of how to use the various libraries.

The following is what libraries can do:

Introduce new data types (classes) and provide public functions (an API) to manipulate themProvide a layer of abstraction on top of a set of hardware or operating system calls

The following is what libraries cannot do:

Introduce new control structures and syntax in support of new application domainsEmbed/support new semantics within the existing language runtime system

Libraries do some things badly, so you might end up preferring to make a new language:

Libraries often get larger and more complex than necessary.Libraries can have even steeper learning curves and poorer documentation than languages.Every so often, libraries have conflicts with other libraries.Applications that use libraries can become broken if the library changes incompatibly in a later version.

There is a natural evolutionary path from a library to a language. A reasonable approach to building a new language to support an application domain is to start by making or buying the best library available for that application domain. If the result does not meet your requirements in terms of supporting the domain and simplifying the task of writing programs for that domain, then you have a strong argument for a new language.

This book is about building your own language, not just building your own library. It turns out that learning about tools and techniques to implement programming languages is useful in many other contexts.

Applicability to other software engineering tasks

The tools and technologies you learn about from building your own programming language can be applied to a range of other software engineering tasks. For example, you can sort almost any file or network input processing task into three categories:

Reading XML data with an XML libraryReading JSON data with a JSON libraryReading anything else by writing code to parse it in its native format

The technologies in this book are useful in a wide array of software engineering tasks, which is where the third of these categories is encountered. Frequently, structured data must be read in a custom file format.

For some of you, the experience of building your own programming language might be the single largest program you have written thus far. If you persist and finish it, it will teach you lots of practical software engineering skills, besides whatever you learn about compilers, interpreters, and the such. This will include working with large dynamic data structures, software testing, and debugging complex problems, among other skills.

That’s enough of the inspirational motivation. Let’s talk about what you should do first: figure out your requirements.

Establishing the requirements for your language

After you are sure you need a new programming language for what you are doing, take a few minutes to establish the requirements. This is open-ended. It is you defining what success for your project will look like. Wise language inventors do not create a whole new syntax from scratch. Instead, they define it in terms of a set of modifications to make to a popular existing language.

Many great programming languages (Lisp, Forth, Smalltalk, and many others) had their success significantly limited by the degree to which their syntax was unnecessarily different from mainstream languages. Still, your language requirements include what it will look like, and that includes syntax.

More importantly, you must define a set of control structures or semantics where your programming language needs to go beyond existing language(s). This will sometimes include special support for an application domain that is not well served by existing languages and their libraries. Such domain-specific languages (DSLs) are common enough that whole books are focused on that topic. Our goal for this book will be to focus on the nuts and bolts of building the compiler and runtime system for such a language, independent of whatever domain you may be working in.

In a normal software engineering process, requirements analysis would start with brainstorming lists of functional and non-functional requirements. Functional requirements for a programming language involve the specifics of how the end user developer will interact with it. You might not anticipate all the command-line options for your language up front, but you probably know whether interactivity is required, or whether a separate compile step is OK. The discussion of interpreters and compilers in the previous section, and this book’s presentation of a compiler, might seem to make that choice for you, but Python is an example of a language that provides a fully interactive interface, even though the source code you type into Python gets compiled into bytecode and executed by a bytecode machine, rather than being interpreted directly.

Non-functional requirements are properties that your programming language must achieve that are not directly tied to the end user developer’s interactions. They include things such as what operating system(s) your language must run on, how fast execution must be, or how little space the programs written in your language must run within.

The non-functional requirement regarding how fast execution must be usually determines the answer as to whether you can target a software (bytecode) machine or need to target native code. Native code is not just faster; it is also considerably more difficult to generate, and it might make your language considerably less flexible in terms of runtime system features. You might choose to target bytecode first, and then work on a native code generator afterward.

The first language I learned to program on was a BASIC interpreter in which the programs had to run within 4 KB of RAM. BASIC at the time had a low memory footprint requirement. But even in modern times, it is not uncommon to find yourself on a platform where Java won’t run by default! For example, on virtual machines with configured memory limits for user processes, you may have to learn some awkward command-line options to compile or run even simple Java programs.

In addition to identifying functional and non-functional requirements, many requirements analysis approaches also define a set of use cases and ask the developer to write descriptions for them. Inventing a programming language is different from your average software engineering project, but before you are finished, you may want to go there and perform such a use case analysis. A use case is a task that someone performs using a software application. When the software application is a programming language, if you are not careful, the use cases may be too general to be useful, such as write my application and run my program. While those two might not be very useful, you might want to think about whether your programming language implementation must support program development, debugging, separate compilation and linking, integration with external languages and libraries, and so forth. Most of those topics are beyond the scope of this book, but we will consider some of them.

Since this book presents the implementation of a language called Jzero, here are some requirements for Jzero. Some of these requirements may appear arbitrary. You could certainly add your own requirements and produce your own Java dialect, but this list describes what we are aiming for in this book. If it is not clear to you where one of the following requirements came from, it either came from our source inspiration language (plzero) or previous experience teaching compiler construction:

Jzero should be a strict subset of Java. All legal Jzero programs should be legal Java programs. This requirement allows us to check the behavior of our test programs when we are debugging our language implementation.Jzero should provide enough features to allow interesting computations. This includes if statements, while loops, and multiple functions, along with parameters.Jzero should support a few data types, including Booleans, integers, arrays, and the String type. However, it only needs to support a subset of their functionality, (as you’ll see later). These types are enough to allow input and output of interesting values into a computation.Jzero should emit decent error messages, showing the filename and line number, including messages for attempts to use Java features not in Jzero. We will need reasonable error messages to debug the implementation.Jzero should run fast enough to be practical. This requirement is vague, but it implies that we won’t be doing a pure interpreter. Pure interpreters that execute source code directly without any internal code generation step are a very retro thing, evocative of the 1960s and 1970s. They tend to execute unacceptably slowly by modern standards. On the other hand, you might very well decide that your language should provide the highly interactive look and feel of a pure interpreter, like Python does. Anyhow, that is not in Jzero’s requirements.Jzero should be as simple as possible so that I can explain it. Sadly, this rules out writing a full description of a native code generator or even an implementation that targets JVM bytecode; we will provide our own simple bytecode machine.

Perhaps more requirements will emerge as we go along, but this is a start. Since we are constrained for time and space, perhaps this requirements list is more important for what it does not say, rather than for what it does say. By way of comparison, here are some of the requirements that led to the creation of the Unicon programming language.

Case study – requirements that inspired the Unicon language

This book will use the Unicon programming language, located at http://unicon.org, for a running case study. We can start with reasonable questions such as, why build Unicon, and what are its requirements? To answer the first question, we will work backward from the second one.

Unicon exists because of an earlier programming language called Icon, from the University of Arizona (http://www.cs.arizona.edu/icon/). Icon has particularly good string and list processing facilities and is used to write many scripts and utilities, as well as both programming language and natural language processing projects. Icon’s fantastic built-in data types, including structure types such as lists and (hash) tables, have influenced several languages, including Python and Unicon. Icon’s signature research contribution is its integration of goal-directed evaluation, including backtracking and automatic resumption of generators, into a familiar mainstream syntax. This leads us to Unicon’s first requirement.

Unicon requirement #1 – preserve what people love about Icon

One of the things that people love about Icon is its expression semantics, including its generators and goal-directed evaluation. A generator is an expression that is capable of computing more than one result; several popular languages feature generators. Goal-directed evaluation is a semantic to execute code in which expressions either succeed or fail, and when they fail, generators within the expression can be resumed to try alternative results that might make the whole expression succeed. This is a big topic beyond the scope of this section, but if you want to learn more, you can check out The Icon Programming Language, Third Edition, by Ralph and Madge Griswold, at www.cs.arizona.edu/icon.

Icon also provides a rich set of built-in functions and data types so that many or most programs can be understood directly from the source code. Unicon’s preservation goal is 100% compatibility with Icon. In the end, we achieved more like 99% compatibility.

It is a bit of a leap from preserving the best bits to the immortality goal of ensuring old source code will run forever, but for Unicon, we include that as part of requirement #1. We have placed a much firmer requirement on backward compatibility than most modern languages. While C is very backward compatible, C++, Java, Python, and Perl are examples of languages that have wandered away, in some cases far away, from being compatible with the programs written in them back in their glory days. In the case of Unicon, perhaps 99% of Icon programs run unmodified as Unicon programs. Unicon requirement #2 was to support programming in large-scale projects.

Unicon requirement #2 – support large-scale programs working on big data

Icon was designed for maximum programmer productivity on small-sized projects; a typical Icon program is less than 1,000 lines of code, but Icon is very high level, and you can do a lot of computing in a few hundred lines of code! Still, computers keep getting more capable, and modern programmers are often required to write much larger programs than Icon was designed to handle.

For this reason of scalability, Unicon adds classes and packages to Icon, much like C++ adds them to C. Unicon also improved the bytecode object file format and made numerous scalability improvements to the compiler and runtime system. It also refines Icon’s existing implementation to be more scalable in many specific items, such as adopting a much more sophisticated hash function. Unicon requirement #3 is to support ubiquitous input/output capabilities at the same high level as the built-in types.

Unicon requirement #3 – high-level input/output for modern applications