Platform Engineering for Architects - Max Körbächer - E-Book

Platform Engineering for Architects E-Book

Max Körbächer

0,0
35,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

As technology evolves, IT talent shortages and system complexity make it essential to have structured guidance for building scalable, user-focused platforms. This book provides platform engineers and architects with practical strategies to develop internal development platforms that enhance software delivery and operations.
You’ll learn how to identify end users, understand their needs, and define platform goals with a focus on self-service solutions for cloud-native environments. Using real-world examples, the book demonstrates how to build platforms within and for the cloud, leveraging Kubernetes. It also explores the benefits of a product-centric approach to platform engineering, emphasizing early end-user involvement and flexible design principles that adapt to future requirements.
Additionally, the book covers techniques for maintaining a sustainable platform while minimizing technical debt. By the end, you’ll have the knowledge to design, define, and implement platform capabilities that align with your organization’s goals.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 639

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Platform Engineering for Architects

Crafting modern platforms as a product

Max Körbächer

Andreas Grabner

Hilliary Lipsig

Platform Engineering for Architects

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

The authors acknowledge the use of cutting-edge AI spell checkers, such as Grammarly, with the sole aim of enhancing the language and clarity of the book, thereby ensuring a smooth reading experience for readers. It's important to note that the content itself has been crafted by the author and edited by a professional publishing team.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Preet Ahuja

Publishing Product Manager: Suwarna Patil

Book Project Manager: Uma Devi

Senior Editor: Adrija Mitra

Technical Editor: Nithik Cheruvakodan

Copy Editor: Safis Editing

Proofreader: Adrija Mitra

Indexer: Tejal Soni

Production Designer: Alishon Mendonca

Senior DevRel Marketing Executive: Rohan Dobhal

First published: October 2024

Production reference: 1270924

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-83620-359-9

www.packtpub.com

To my wife, best friend, partner in crime, and North Star, Lesya, for being my loving partner and biggest supporter, who always has my back and calls my creativity.

- Max Körbächer

To my parents, who allowed and supported me to follow a different career path than they initially had in mind. Thanks for being the best role models one could wish for.

- Andreas Grabner

To my husband, Scott, without whom I’d never have become an engineer. To my parents, Annette and Calvin, who have loved me and believed in me – your encouragement guides me. To my parents-in-law, Joe and Mary Ann, who have been there every step, and to my kids, who support my dreams and still think I’m fun to hang around. Thank you for everything, all of you.

- Hilliary Lipsig

Contributors

About the authors

Max Körbächer is a technology advisor and platform architect who focuses on utilizing cloud-native technologies and open source to simplify the challenges of complex systems. He is the founder and managing director of Liquid Reply, a cloud-native engineering and consulting company. His work history includes roles as an enterprise architect in the media and power utility industry and as a demand manager planning medium and large IT projects. Max is also the founder and, currently, co-chair of the CNCF Environmental Sustainability Technical Advisory Group, CNCF Ambassador, Linux Foundation Europe Advisory Board member, and initiator and organizer of the Kubernetes Community Days Munich and Ukraine.

I want to thank my family – especially my wife, Lesya; my best writing supporter, Felix, who stayed for all the late-night writing sessions; and all the open source community people for discussing ideas, solutions, and the future of cloud native.

Andreas Grabner is a technical advocate for making distributed systems observable and making automated data-driven decisions across the software development lifecycle. In his capacity as a CNCF ambassador and a DevRel at Dynatrace, he connects and educates global software engineering communities on building and continuously validating digital services for resiliency, high availability, and security.

Since his early days, he has been passionate about software quality and performance engineering as it results in building excellent digital products. Andi uses his advocacy platforms to share best practices on topics such as observability, progressive delivery, DevOps, site reliability engineering, platform engineering, and digital business operations!

I want to thank my wife, Gabi, for being supportive throughout the whole process, especially while writing during our vacation. Also, a big thanks to my employer and all the global tech communities that supported me along the way!

Hilliary Lipsig is an autodidact and start-up veteran who has frequently learned and applied technologies to get a job done. She’s had her hand in every part of the application delivery process, honing her skills originally as a quality engineer. Hilliary is an IT polyglot, able to talk the lingo of both the Operations and Development teams. She’s currently a Principal Site Reliability Engineer at Red Hat Inc., working on Kubernetes-based platforms. She’s passionate about GitOps, continuous integration, scalable processes, consistency in tooling, and good developer documentation. Her open source activities include contributions to the CNCF Glossary and she’s a member of the Code of Conduct Committee for Kubernetes.

A big thank you to my family, friends, and colleagues who supported me, let me spitball ideas at them, and even volunteered to proofread for me. Your belief in me helped me believe in me.

About the reviewers

Peter Portante has been a Red Hat, Inc. employee since late 2011, working as a senior principal software engineer in the Performance and Scale team for the first 12 years and as an Enterprise Account Solution Architect since the spring of 2023. Prior to Red Hat, he worked at HP/Compaq/Digital on HP ePrintCenter, Tabblo, TruClusters, gWLM, POSIX Threads (two-level scheduling in Tru64 Unix and OpenVMS), and the OpenVMS Print Symbiont.

Fábio Falcão has been a technology professional for over 20 years, with solid training as a SysAdmin and in-depth knowledge of DevOps methodologies and tools. Having graduated in computer science and being passionate about technology and how it helps improve people’s lives, he decided to pursue this career as a child and, today, really loves what he does, helping teams achieve their goals. From Brazil and living in Portugal since 2022, he joined IBM to work with the data lineage team, in addition to supporting several other products linked to artificial intelligence.

I would first like to thank my parents (in memoriam) for all their dedication and encouragement throughout my life. I would also like to thank my wife and children, who support me when I don’t have enough time for them, and my friends, who always help me grow professionally and personally with their advice and chats.

Lukasz Bielinski is a seasoned senior DevOps leader with 15 years of experience in IT, specializing in containerization and digital transformation for large-scale systems across banking, telecommunications, pharmaceuticals, and software industries. He has successfully led and executed numerous Kubernetes projects, significantly enhancing operational efficiency and system reliability. Lukasz co-founded and formalized a start-up where he played a crucial role in introducing Kubernetes and OpenShift into organizations, driving successful digital transformations. His technical acumen spans a wide range of DevOps tools and practices, making him a trusted advisor and leader in the field.

I would like to express my gratitude to my family for their patience and support as I dedicated time to reviewing this book. Special thanks to the authors for entrusting me with this important task and to my colleagues for their ongoing encouragement. I’m grateful to be part of a community that values continuous learning and collaboration.

Thomas Schuetz is a cloud native architect with a keen interest in cloud-native application delivery. He teaches at an Austrian University of Applied Sciences, focusing on cloud-native technologies, where he shares his industry insights and practical experiences with students. Involved in the cloud-native community as a CNCF Ambassador, Thomas is enthusiastic about open source projects, contributing as a Keptn GC member. His approach to software delivery and troubleshooting combines practical know-how with a passion for education, aiming to make cloud-native technologies accessible to all.

I am deeply honored to have been invited to review this book by thought leaders in the cloud-native community. Contributing to this work has been both satisfying and rewarding, and I thoroughly enjoyed reading it. I would also like to express my gratitude to the entire cloud-native community for their continuous innovation and for making advancements such as platform engineering possible.

Biswajyoti Chowdhury has over 13 years of experience in cloud-native platform engineering, has led the development of multiple internal development platform initiatives, and has successfully migrated several large enterprises to the cloud and established their development platforms. He has worked with prominent technology consulting firms such as Accenture, Wipro, and Cognizant. Currently, he serves as an architect at Financial Software & Systems, where he is focused on building a payments-centric development platform. Biswajyoti is passionate about open source, cloud-native infrastructure, and tools that enhance developer productivity. Biswajyoti holds industry certifications such as GCP PCA, ACE, and is a Red Hat Accredited Professional in Cloud-Native development.

I would like to thank my family, my mentor, and my colleagues, who have put up with me and shaped me into who I am today.

Table of Contents

Preface

Part 1 – An Introduction to Platform Engineering and Architecture

1

Platform Engineering and the Art of Crafting Platforms

The demand for platforms as a product

Companies and developers benefit from platforms in an equal manner

Platform case studies and success stories

Projects versus products

Platform as a product

Do you need a platform?

Do we need yet another abstraction layer?

Declutter the abstraction layers

The cognitive load for software engineers and other IT professionals

Implementing developer- and product-focused solutions

The pervasive cloud

Focusing on developer experience

Attributes of platforms

Understanding the socio-technical aspects

Understand user needs in platform design

Foster and enhance collaboration

Cultivating an open, platform-centric culture

Summary

Further reading

2

Understanding Platform Architecture to Build Platform as a Product

Understanding platform principles and defining the purpose of your platform and team

Introducing principles as guardrails for decision-making

Developing the purpose of your platform as a product

Exploring platform architecture – layers, components, and meta-dependencies

Platform component model

Platform composability

Dependencies and the hidden glue

Reference architectures

Opinionated platforms and the cost of quality

Creating your own architecture

Exploring platform as a product – use cases and implementations

Finding the experts and the bottlenecks they cause

Centralizing expertise as a self-service use case

Understanding TVPs

Finding your TVP use case

Good enough versus perfectly done!

TVP – validating our hypothesis

Build, measure, and learn

Looking at the relevant KPIs to make adoption transparent

Defining platform adoption KPIs

Using performance metrics

Summary

Further readings

3

Building the Foundation for Supporting Platform Capabilities

Financial One ACME – our fictitious company

Overcoming platform complexity by finding the right perspective

Applying basic product management – “Don’t give your users a faster horse”

Avoiding the “sunk cost fallacy”

Steps to building the thing that users need – a real-life example

Considering existing processes and integrating a new implementation

Understanding the existing SDLC – “the life cycle of an artifact”

Introducing life cycle events – measuring and improving the efficiency of the SDLC

Presenting the value proposition for improving the existing SDLC/DORA

Designing the infrastructure architecture

Avoid the ivory tower approach – we own the platform!

Organizational constraints – existing infrastructure requirements?

Connectivity constraints – interoperability requirements?

Resiliency constraints – SLAs and other non-functional requirements?

Exploring multi-cloud, multi-SaaS, and the fragmentation of capabilities

Multi-tenancy and ownership as a capability of our platform

The considerations for running on multi-X

Centralized and decentralized platform capabilities

Exploring a reference architecture for our platform

The purpose – self-service for your end users

User interface/dev experience

Core platform components

A platform that’s available, resilient, and secure

Success KPIs and optimization

Summary

Further reading

Part 2 – Designing and Crafting Platforms

4

Architecting the Platform Core – Kubernetes as a Unified Layer

Why Kubernetes plays a vital role, and why it is (not) for everyone

Kubernetes – a place to start, but not the endgame!

Would Financial One ACME pick Kubernetes?

Benefits of picking Kubernetes as the core platform

Global community and CNCF

Leveraging and managing Kubernetes Infrastructure Capabilities

Integrating infrastructure resources

Enable cluster scalability

Network capabilities and extensions

Kubernetes as part of the platform control plane

The problem of external versus internally defined resources

Designing for flexibility, reliability, and robustness

Optimize consumption versus leaving enough head space

Summary

Further Reading

5

Integration, Delivery, and Deployment – Automation is Ubiquitous

An introduction to Continuous X

High-level definition of Continuous X

Continuous X for infrastructure

Continuous X as a system-critical component in our platform

GitOps – Moving from pushing to pulling the desired state

Phase 1 – from source code to container image

Phase 2 – from container image to metadata-enriched deployment artifact

Phase 3 – GitOps – keeping your desired deployment state

Understanding the importance of container and artifact registries as entry points

From container to artifact registry

Building and pushing artifacts to the registry

Managing uploaded artifacts

Vulnerability scanning

Subscribing to the life cycle events of an artifact in the registry

Retention and immutability

Monitoring our registries

Defining the release process and management

Updating deployment to a new version

Batching changes to combat dependencies

Pre- and post-deployment checks

Deployment notifications

Promotions between stages

Blue/green, canary, and feature flagging

Release inventory

Release management – from launch to mission control

Achieving sustainable CI/CD for DevOps – application life cycle orchestration

Artifact life cycle event observability

Working with events

Subscribing to events to orchestrate

Analyzing events

IDPs – the automation Kraken in the platform

Providing templates as Golden Paths for easier starts!

Abstractions through Crossplane

Everything Git-flow-driven

Software catalog

Summary

Further reading

6

Build for Developers and Their Self-Service

Technical requirements

Software versus platform development – avoiding a mix

The platform life cycle versus the software life cycle

Reliability versus serviceability

The conclusion

Reducing cognitive load

Utilizing a platform while balancing cognitive load

Pre-production versus production

Authentication and tenancy

RBAC

Noisy neighbor prevention

Rate limiting and network health

Cluster scaling and other policies

Enabling self-service developer portals

Enforcing quota

Simple repeatable workflows

Landing, expanding, and integrating your IDP

Enforcement of platform-specific standards

Maturity models

Expanding a platform with common platform integrations

Architectural considerations for observability in a platform

Observability in a platform

Centralized observability – when and why you need it

Important metrics

Observability in service for developers

Opening your platform for community and collaboration

Planning in the open

Accepting contributions

Summary

Part 3 – Platforms as a Product Best Practices

7

Building Secure and Compliant Products

Reconciling security to the left and Zero Trust

Understanding platform security – how to build a secure yet flexible and open system

Breaking down the problem into consumable chunks

Common security standards and frameworks

Asset protection

Secret and token management

Secure access

Audit logs

Looking at SBOM practices

How to use an SBOM

Keeping on top of vulnerabilities

Understanding pipeline security – what you have to consider to secure your CI/CD pipelines

Securing your repo

Securing GitOps

Understanding application security – setting and enforcing policies

Foundational application security

FOSS for platform security and how to use it

Patterns and tools for managing security

What would our fictitious company do?

Summary

8

Cost Management and Best Practices

Understanding the cost landscape – is the cloud the way to go?

To cloud or not to cloud – that’s the question

When we opt for the cloud – we have to consider its hidden costs

Where to find transparency

FinOps and cost management

Implementing a tagging strategy to uncover hidden costs

Using tags for a purpose

Tag and label limitations

Defining a tagging strategy

Tagging automation

Consolidated versus separated cost and billing reports

Looking at cost optimization strategies

Streamlining processes

Finding the best deals for the best prices

Designing for the highest utilization and lowest demands

Autoscaling, cold storage, and other tricks for cost optimization

Many shades of autoscaling

Cost-aware engineering

Summary

Further reading

9

Choosing Technical Debt to Unbreak Platforms

Taking technical debts consciously

Moving beyond the thinnest viable platform (TVP) sustainably

Avoid over-engineering

Build versus buy – building a decision tree

The criticality of team buy-in

Using data to drive design decisions

Observability is key

Data retention is technical debt

Maintaining and reworking technical debt

Own your technical debt

Platform composability

Dependencies

Security

Not all technical debt has equal weight

Rewriting versus refactoring data – a practical guide

Determining whether a rewrite is necessary

Examining the external influences on refactoring with an example

Examining a famous rewrite

Transitioning after rewrite

Architectural decision records – document for the Afterworld

Why document software architecture?

What does good technical documentation look like?

Our fictitious company – a final look

Summary

10

Crafting Platform Products for the Future

Continuous changes – learning to age and adapt

The imperative of change

Fostering a culture of change

Iterative evolution and an evolving strategy

Considering sustainable and lightweight architectures and approaches

Enable lightweight architectures

Providing support for the users to adapt

The Golden Path for changes

Providing feedback channels

Golden Paths are the features of our platforms

Start small, expect growth, and don’t become a bottleneck

Golden Paths to Build Golden Paths

A glimpse into the future

The replacement of hypervisors

AI for platforms

OCI Registry as Storage and RegistryOps

Containerized pipelines as code

Platforms – a better future with them?

Summary

Further reading

Index

Other Books You May Enjoy

Preface

Hey, welcome to Platform Engineering for Architects! Platform engineering is the practice of creating environments that can build, test, validate, deploy, and operate software in a secure and cost-efficient way. Platform engineering is about automation, and about enabling the platform’s users, developers, and operations to focus on value creation. A platform, often defined as an Internal Development Platform or Internal Developer Portal (IDP), abstracts away the complexity of the underlying infrastructure and all the moving components required to support the software life cycle from its setup until it goes live in production. But platform engineering is more than just technologies that have to play well together. It requires an open mindset and a holistic approach to define the purpose of the platform, along with following principles in the decision-making process and fostering a culture of change and innovation.

In Platform Engineering for Architects, we will engage you in building up a product mindset to build a solution that ages and matures with time but stays young at heart. Step by step, we will create our strategic direction and define our target architecture for the platform. This will become a living artifact for you and everyone working with you. Throughout the book, we will cover the four different pillars of a platform and the relevant decisions you have to take: the infrastructure representation by Kubernetes, the automation, the self-service capabilities, and the built-in observability and security. By the end, you will be equipped with tools to handle costs including actual infrastructure costs and technical debts. If both are managed well, they can even become your best allies in overcoming organizational obstacles and politics.

For every aspect and topic we write about in this book, we also provide additional sources that cover full technical details. Our goal is to provide you with a framework of references and an approach for defining platform architectures that can mature over time without being tied to a specific technology or version. We are well aware that the only constant in life is change, and this holds true for platform engineering. That’s why we also encourage you as a platform engineer and architect to keep up to date with those changes and bring the product mindset to your users and stakeholders.

Who this book is for

The book is for platform engineers and architects, DevOps engineers, and cloud architects who want to transform their way of implementing cloud-native platforms to use a platform as a product.

This book is also for IT leaders, decision-makers, and IT strategists who are searching for new approaches to improve their systems landscape and software delivery, covered by a holistic approach that goes beyond the simple “you built it, you run it.”

What this book covers

Chapter 1, Platform Engineering and the Art of Crafting Platforms, provides an introduction to platforms and IDPs. It covers the relevance of a product mindset and the ambition to build a system desired by users.

Chapter 2, Understanding Platform Architecture to Build Platforms as a Product, will guide you through all the relevant groundwork and approaches to creating your platform architecture. You will discover the value of the platform as a product, the first implementation of the thinnest viable platform, and how to observe and measure its success and adoption.

Chapter 3, Building the Foundation for Supporting Platform Capabilities, will walk you through the mandatory steps and processes of defining a solid foundation of a platform that can grow from an initial set of features toward key enterprise-supporting platform capabilities.

Chapter 4, Architecting the Platform Core – Kubernetes as a Unified Layer, provides insights into what makes Kubernetes the preferred platform for platform engineers. You will learn about the core integrations and relevant decisions we have to make before we can focus on extra enhancements.

Chapter 5, Integration, Delivery, and Deployment – Automation is Ubiquitous, provides you with a stable understanding of the complexity around building, deploying, testing, validating, securing, operating, releasing, and scaling software and how we can centralize and automate this experience with self-service capabilities.

Chapter 6, Build for Developers and Their Self-Service, reviews concepts around IDP integrations and shares best practices for building resilient, flexible, and user-oriented platforms.

Chapter 7, Building Secure and Compliant Products, elaborates on security standards frameworks, and trends; how to leverage the software bill of materials; and defining the right actions to secure the platform without limiting your capabilities. Furthermore, we show you how to ensure the app delivery process will provide hardened and secure software/container packages and how to use policy engine technologies.

Chapter 8, Cost Management and Best Practices, explains the concept of cost-increasing elements of a platform and how to optimize those costs. You will learn about tagging strategies, general cost optimization scenarios, how to use observability to identify optimization potential, and best practices to put them into practice.

Chapter 9, Choosing Technical Debt to Unbreak Platforms, provides you with tools, frameworks, and methods to actively manage your technical debts. Like with costs, technical debts can grow and will have a negative impact on your platform if untreated.

Chapter 10, Crafting Platform Products for the Future, emphasizes the imperative of change and our role as platform engineers in fostering change in a controlled way, balancing reliability and innovation.

To get the most out of this book

You should understand the basics of cloud computing, Kubernetes, the ideas around platform engineering, and how to define those architecture-wise.

Software covered in the book

Kubernetes

Backstage

CI/CD solutions such as GitHub Actions

Keptn

Argo CD

Crossplane

Prometheus

OpenTelemetry

Harbor

OpenFeature

Renovate Bot

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Platform-Engineering-for-Architects. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter/X handles. Here is an example: “The directory path is nested within the chart, and the use of this directory called crds/ allows Helm to pause while the CRDs are added to a cluster before continuing with the chart execution.”

A block of code is set as follows:

apiVersion: apiextensions.k8s.io/v1beta1 kind: CustomResourceDefinition metadata:   name: crontabs.stable.example.com spec:   group: stable.example.com   version: v1   scope: Namespaced   names:     plural: crontabs     singular: crontab     kind: CronTab     shortNames:     - ct

Any command-line input or output is written as follows:

$ kubectl label nodes platform-worker2 reserved=reserved node/platform-worker2 labeled

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “These platform logs should also, as much as possible, be clear of any Personally Identifiable Information (PII).”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.Share Your Thoughts

Share Your Thoughts

Once you’ve read Platform Engineering for Architects, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/978-1-83620-359-9

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1 – An Introduction to Platform Engineering and Architecture

In the first part, you will learn more about the foundation of platform engineering and platform architecture. We will sharpen your understanding of a platform product mindset, showing you that platform engineering is more than just building systems. Then, we will guide you through the process of creating a platform architecture, from defining the purpose of a platform to the thinnest viable platform, as well as how to measure a platform’s success and acceptance. In Chapter 3, we will focus on designing the foundation of a platform to enhance the user experience and avoid technical complexity.

This part has the following chapters:

Chapter 1, Platform Engineering and the Art of Crafting PlatformsChapter 2, Understanding Platform Architecture to Build Platforms as a ProductChapter 3, Building the Foundation for Supporting Platform Capabilities

1

Platform Engineering and the Art of Crafting Platforms

In this first chapter, we will learn how to identify when our organization is in the right state to plan a platform. For this, we will clarify why platforms have become such a relevant topic, how a product mindset fits into this, and what the checkpoints are to find out whether we are ready for a platform or not. We will learn about the platform differences and which platform types are most commonly built.

Next, we will delve into the three core elements of a platform: the pervasive cloud, the developer experience, and the main attributes of a platform. Overall, we will see recurring elements of cloud-native engineering. This leads us to the question of whether we really need yet another abstraction layer. We will also consider whether a platform will help us to overcome the problem of a high cognitive load caused by overengineered complex systems and development processes or just end up being yet another layer. We will reflect on some of those layers to find an answer for ourselves.

Finally, we will go into aspects that go beyond the technology and the implementation of platforms. It is crucial to understand the sociotechnical aspects and put the human, our actual stakeholders, at the center. This allows us to define a better platform product and find approaches for a close collaboration.

In this chapter, we’re going to cover the following main topics:

The demand for platforms as a productImplementing developer- and product-focused solutionsDo we need yet another abstraction layer?Sociotechnical aspects

The demand for platforms as a product

In the cloud-native environment, hardly any other topic has built up such a myth in recent years as the term platform and the associated role of the platform engineer. As with the introduction of the first usable CI/CD pipelines, this gold rush led to rapid adaptation, often without sense or reason. Now that we have arrived in the valley of knowledge, we can deal extensively with the question: do you need a platform, and if so, how do you design and implement it to ensure that it lasts into the future?

To answer this question, we should first look at what constitutes such a platform. A platform is the combination of different capabilities that are required to master traditional and cloud-native environments so that it supports the end user in the development, delivery, and operation of an application. Platforms can be an enabler to turn non-cloud native infrastructures into valuable resources. However, most computing platforms today provide some sort of API that can be used to automate the deployment and instrumentation of the available resources and build the foundation of a platform. Platforms provide consistency across any kind of resources for the end users and grant access to its capabilities via a self-service API, templates, CLI, or other solutions. The following example also highlights that a platform is composed of many components:

Figure 1.1: Example of a platform/IDP

We see usually the topic of platform appears in the context of cloud-native, but why is that so? Cloud-native technologies enable organizations to build and run scalable applications in public, private, and hybrid clouds. This approach is best illustrated by functionalities such as containers, standardized service provisioning, immutable infrastructure, and declarative APIs. Such functionalities realize loosely coupled systems that are resilient, manageable, and observable. These enable developers to make frequent changes with minimal effort. In short, a platform is an enabler for cloud-native computing and uses its tooling to instrumentalize it.

Companies and developers benefit from platforms in an equal manner

The experience of a software engineer on a cloud-native platform differs from developing software natively toward a cloud provider. Building systems focusing on one Cloud Service Provider (CSP) will bind you to the logic of that closed ecosystem. You will surely have a similar effect when you build on cloud-native platforms due to the fact that those are often Kubernetes-centric, utilizing the heavy unification of integrations toward the Kubernetes API. However, the catch is that cloud-native platforms deliver the same experience without you recognizing the underlying infrastructure. As most companies have at least two to three cloud or cloud-like service providers and already have difficulties in adapting those, a cloud-native platform is a game changer [1]. Developing software on a cloud-native platform changes the mindset and architecture. However, without adopting that mindset, the chance of failure is high.

However, there are more aspects to consider for a platform than just unified infrastructure management. Platforms have to be made for a purpose. The common definition of whom platforms are built for, and who the stakeholders for platform engineers are, states that those are exclusively developers. These definitions fall short of mentioning that a whole organization, operational teams, and other specialist teams also benefit from a platform. A platform provides software engineers with a simple access point to build, test, deploy, release, and operate their software. It provides deep insights into the usage and allows the caretaker and administrators to maintain the infrastructure, platform, and integrations fearlessly. To translate this into business terminology, a platform can provide a faster time to market, with more flexibility to change and adjust its components, while keeping reliability and robustness high.

What does this mean for a company now? Due to the shortage of IT professionals on the market, the fast pace of changes in IT, and the overload of training teams for cloud technologies and providers, a platform introduces the right breakpoints for competencies. We need these breakpoints to declutter the trend of putting multiple disciplines into a single role such as DevOps. Also, platform engineers utilizing DevOps methodologies are not DevOps. We actively need to protect this role from repeating the mistakes made with the DevOps role and stay sharp in its definition. Platform engineers integrate experts’ provided capabilities, simplify their usage of those capabilities through their platform for developers, and enable self-service for the engineers. However, no developer will need to become an expert in multiple topics such as security, observability, infrastructure configuration and automation, and so on. This is in contrast to a common picture of DevOps, who need to become experts with anything that is required within their silo for their application to keep them alive. We will need DevOps in the future for the advanced handling of applications, but we must make their lives easier, too.

The platform provides an integration layer for the bottom-up capabilities that require special knowledge such as security, databases, or even the deployment of VMs or bare-metal servers, as well as top-down usage by developers and DevOps. As visualized in the following figure, the platform engineering team is responsible for providing this layer.

Figure 1.2: Capabilities and responsibilities in a platform-driven organization

Of course, this also means that another team of experts must be trained and educated. However, a comparably small team of platform engineers can usually build and run huge environments. A platform ideally reduces the cognitive load for any other team within the company and lets them focus on their core value again by simplifying the machinery around the development process. This platform helps to reduce the stress and improve transparency. Companies from all over the world frequently share their experience with platforms and platform teams, and the typical tenor is on how they solved problems they couldn’t tackle before, or how much this has improved the quality of their products and services.

These platforms are often called Internal Developer Platforms (IDPs) because they are usually built for an enterprise’s internal development team. Throughout this book, we will use the terms platform, IDP, platform product, and cloud-native platform interchangeably. However, we’ll first highlight certain aspects just a little bit more:

Platform: General term for the cross-cutting layer of technology that allows unification of services for developers.IDP: Emphasizing the aspect of developer, Software Development Life Cycle (SDLC), and tools needed to develop softwarePlatform product or platform as a product: Highlighting the dedicated team taking care of the evolution ability and long-term commitment of a platform, as well as establishing a different mindsetCloud-native platform: Focusing on the abstraction and enablement to use standardized APIs and integrations

That perspective might feel fine-grained, but the term platform itself often leads to more confusion. A cloud platform is also a platform, right? A Software as a Service (SaaS) could also be seen as a platform. Referencing a cloud-native platform or IDP gives the right direction and understanding. Depending on your organization’s maturity, it is therefore also essential to clarify these terms and establish a common understanding, language, and shared knowledge.

Platform case studies and success stories

To highlight the positive impact a platform can have, we can look at three totally different companies and their results from using IDPs. All of these cases primarily focus on Backstage as the developer portal and entry point for the IDP.

Spotify, as the inventor of Backstage and the mother of the IDP movement, claims that the following is true of their internal Backstage users:

2.3x more active on GitHub2x more code changes2x more deploymentsOnboarding time for new developers dropped from 60 days to 20 days

The Expedia Group reports different numbers:

It takes four minutes, on average, to create a new component or appOver 4,000 users are using the IDP for at least 20 minutes per dayThe technical documentation is viewed over 50,000 times per monthJust over 15% of the internal developer tools are integrated with Backstage, already reducing context-switching

Now, the last company we should take a look at is Toyota:

Projects ship now weekly artifacts instead of monthly8-12 weeks are saved on overhead efforts per team, resulting in over $5 million in reduced costs or time and budget used for value creationStandardize deployment templates reduce failure and speed up deployment

All those numbers are interesting to understand in the context of a digital-native company, a travel technology corporation, and one of the biggest car manufacturers. Any of them can show a clear positive effect [2].

Projects versus products

Speaking of organization, introducing new solutions is commonly done as a project. So, at some point in time, someone decided to invest money in building their own platform. This approach faces one fundamental problem: the deadline. Projects are required to reach a target within a given time and budget frame. If the project runs out of time or money, it focuses on the operation and maintenance of it. These two parts of a life cycle are treated as separate things, causing a time period of rising sun and sunset. To explain this a bit, you can see heavy investments, communications, and excitement during the implementation phase. However, after hitting the deadline, the project turns into a dead object that requires maintenance. DevOps didn’t change this behavior; it simply often got new names, new roles, and different processes. However, in the end, budget, people, and attention are turned away to the next project, while just a fraction of the former budget stays. This is frustrating for engineers who have worked hard on the implementation, and it will become frustrating for the organization over time when costs for pure maintenance keep increasing. Still, the people who have built it might leave or join other projects. This short-term view on implementing systems has slowly killed many good projects and team spirits. More importantly, it shows that the business value of the solution isn’t clear. When an implementation, such as a platform, can provide explicit value, there shouldn’t be a reason to turn away attention and cause its sundown.

While doing many implementations as projects is a valid approach, this is its death sentence with platform implementations. Regular implementations are feature complete; they can exist after they are done. But a platform will be always moving, always being upgraded and always be implementing new features. When working with open-source and cloud providers, you will learn early how fast tools and software are in their own development cycle. Features, fixes, and security patches are continuously published. This is a significant challenge for larger organizations as they are still used to far slower release cycles. The upside of keeping up the speed of this rapid development is that you, as an organization, can profit from new features and capabilities frequently. It is an innovation driver and enabler, allowing you to implement systems in other ways and solve problems you might not be aware of. Is a problem that isn’t painful for you a real problem? Organizations tend not to consider such things as issues since they are used to only identifying painful processes and approaches.

Let us look at an example. In the current year (2024), the European Union released a law to improve companies’ reporting on their carbon emissions. On a high level, this also includes IT resources. Also, within the last few years, multiple open source foundations and projects have been started to bring transparency to the energy consumption of software. A year ago, we would only have been able to report very rough, highly estimated numbers when it comes to energy consumption for a data center, for a server, and with some manual processing, for a piece of software. Today, we can obtain fine-grained information for any application running on bare metal, hypervisor, or containerized within Kubernetes as tooling has evolved to provide this data. Public CSPs provide more and more insights into their own energy consumption. What can we expect for the coming year? We can expect even better numbers, including the regional carbon mix of the energy and end-to-end visibility and transparency of such numbers. With platforms and platform engineering teams, such transparency will naturally come over time. It doesn’t require a project turning IT upside down. Calling out the demand for it will result in platform engineering teams implementing those capabilities into the platform’s core to benefit everyone who uses that platform to build, deploy, release, and operate software.

This is called a product mindset and it feels natural for platforms to adapt to the demands of their environment.

Platform as a product

Platforms as a product are user-centric, listening to and actively researching the end user demand to keep improving their services. A product is also aware of its value. Similar to any app on your mobile phone, it uses its own value to refinance further development and new features. Here, there are no deadlines and no sunsets. The goal is just to strive to become better with every release. What this gives to an organization is an expert team that keeps actively working on central enablement for providing your business with a platform of value generation.

Designing and developing a platform as a product goes beyond the pure engineering aspect. It faces organizational challenges that should be considered when you actively decide to build a platform. In fact, you have to deliver valid numbers on your benefits and show that your platform carries its own costs. This must be in the mindset of the product owner and platform engineers. The idea here is not to become business people but to be able to clearly communicate the reason for existence and, more importantly, to be a product that doesn’t have a deadline.

Right now, you can find three different types of platforms as products:

IDPs:Provide a best-in-class experience for software engineersEnable the development and operations teams for the end-to-end support and visibility of their softwareBring governance, compliance, and securityEstablish a self-service for the development teams and simplify the deployment processData science and machine learning platforms:Similar to IDPs and often evolve out of thoseLeverage their scalability to research, analyze, and process data cost-efficientlyOvercome complex implementation and make them generally availableProvide direct, secure access to relevant data sourcesLow-code/business platforms:Strongly driven trend to provide platforms that bring solutions with which it is possible to implement new features with relatively less to almost no coding demandWe will see them more in the years to come

In our book, we will focus on the product-centric view of architecting IDPs.

Do you need a platform?

Like with any other complex environment, we first have to ask – do we really need a platform? Do we know what we will use it for?

Although platforms can provide a lot of benefits, they are not always the answer to your organizational questions. The signs that you are not ready for a platform yet are as follows:

You have only monolithic applicationsYou don’t have your own development teamYour DevOps, SysAdmin, or infrastructure team is heavily overworked or siloedYou have very simple applications that can run anywhereYou are having a hard time providing a budget for training to grow the skills of your teamsYou usually run commercial, off-the-shelf solutions

On the other hand, when does an IDP make sense to you? The following criteria are indicators that you’re ready for an IDP:

You have requirements for multiple infrastructure environments or foster a multi-cloud strategyYou require advanced control over your environments (security, compliance, and deep insights into infrastructure and application behavior)Your development team is continuously overloaded with non-valuable tasksYou have a curious and interested DevOps or infrastructure team that has taken its first steps toward a platform without knowing itYour application requires some kind of orchestration due to microservice architecture because many components or different integrations need to play well togetherYou want to enable your organization to optimize your IT for costs, transparency, quality, or security

Before you define a platform product for your organization, you should answer all the points on the checklist. It makes sense to have multiple points illustrating why you need it. For example, having a team ask for an IDP, or having someone mention that they’ve heard of it in a conference, is not a strong foundation for making such a decision. The introduction of platforms is a journey, and from our experience, it can become the central focal point for one’s company in a relatively short amount of time. Under such pressure, you still need to have a purpose and direction.

Now that we have learned how to define the purpose of our platform, we will need to discuss the question of whether we really need this additionalabstraction layer.

Do we need yet another abstraction layer?

Let’s briefly summarize what we have seen so far. A platform puts a bracket around, and a layer of abstraction on top of, your existing infrastructure and environments. The platform enhances it with further capabilities so that your development teams can utilize it in an automated, self-serving way.

From a technical perspective, this represents the next layer of abstraction. Therefore, it is only right to discuss this new fabric we put on top. Going from the bottom to the top, we see the bare metal, followed by the hypervisor for virtualization; this is topped by cloud providers. Some might include containers, Kubernetes, or serverless components – and now, we will add our platform. These are at least four layers, each promising to make the layer underneath simpler and glued together by a hard-to-be-defined meta-level of scripts, Infrastructure as Code (IaC), cloud libraries, and automation. So, do we really need this yet another layer, or are we using it to keep ourselves busy building things?

Declutter the abstraction layers

There is no simple answer to that question, but by looking into the purpose of each layer, you might be able to grow your own understanding of it.

Hypervisors were initially introduced to simplify the supply of hosts so that software could run and better utilize servers. Today, they still serve the same purpose but could be replaced by Kubernetes and container runtimes. A key argument against this replacement is that a virtual machine provides better isolation and higher security. Without getting too deep into this discussion, there are options to provide very solid isolation, such as with Kata containers. The only component that causes headaches is the container with the OS and its runtime. Looking some years into the future, WebAssembly (Wasm) could be one part of the answer to that problem. Without an operating system in the container and pure naked binary files, there are almost no gates open for attacks. However, let’s give it some more time.

Infrastructure as a Service (IaaS) providers and public cloud providers enhance this with software-defined storage and networks, reducing the complexity of building your own data center and managing all physical dependencies. In addition, they provide further capabilities of commonly used scenarios, such as databases, load balancing, user management, message queues, or ML playgrounds and pre-trained AI models. This is a very useful implementation, which leads to a rapid development of the industry and an extension of what is possible. However, this also moved the whole industry in a problematic direction. The technology is developing faster than people and organizations, in particular, are able to adapt to it. We see that there is a shortage of professional engineers across the globe, while businesses are looking into providing more digital services every year. Solutions and their dependencies are therefore built natively to the cloud. The return on this effort can be significant. You are able to manage any kind of infrastructure and services with a relatively small group of people globally. Yet the reality also states that the average company has between two and three IaaS and public CPSs, plus (often) its own computing capacities, as well as around 10 SaaS providers, which can go up to around 50 for major enterprises [1]. CSPs also have between 40 to 200 different services. In other words, we are able to achieve a lot today, but the complexity of those environments has also become significant.

To tame this scale, IaC and Cloud Development Kits (CDKs) have become the tools of choice to manage your landing zones and software integrations. The fun part of the story is that practices such as DevOps, which are commonly misinterpreted, have made things even worse. These misinterpretations have now led to the sudden expectation for developers to also set up and maintain the infrastructure for their software needs.

Last but not least, we have systems based on containers, Kubernetes, or serverless. Dozens of options exist for each of them to provision those environments, deploy the code, and run the components. It’s understandable that there are too many layers you have to take care of. However, their development is reasonable, as you don’t want to do things in the old way of getting software up and running. Pushing code to images and from there to a runtime you choose simplifies the provisioning process.

Overall, to represent the level of complexity, we can think of a three-dimensional object such as a cube made of cubes. The following illustration shows the different service layers, representing different maturity of abstraction and how layer after layer comes together to form an IT environment.

Figure 1.3: The multi-dimensional complexity of computing abstraction and simplification

Now, that figure is oversimplified, considering the hundreds and thousands of options you have in each dimension. However, it still gives a first good hint: if you need a platform, build a wrapper around this construction and tame its immense complexity to harness its power.

The cognitive load for software engineers and other IT professionals

In order to manage all these layers, we need to know about and use many tools, as well as follow various processes. It becomes difficult to focus on the actual job and create value while spending a large amount of time on things that should simplify our work. This is called the cognitive load. Originally made famous by Daniel Bryant, that term puts a bracket around the job overload and mental stress of many developers, as well as other specialists within IT. Reducing the cognitive load brings more happiness and satisfaction, but also effectiveness and reliability, to the engineers. Looking at the following graphic simplifies the perspective on what needs to be handled as a professional across the different decades. However, going forward, we have to reduce this load. AI could be part of this, alongside new concepts for running computing processes, and platforms of course.

Figure 1.4: The extended cognitive load with a projection to an ideal future

Not only does technology change over time but it also piles up. This means that we have to run and maintain legacy systems while changing architecture styles and introducing new programming paradigms and new tools. This also changes the responsibilities and extends them far beyond the typical borders of one’s job description from some years back. Breaking down this problem can reveal an answer to the question of whether we need a platform to solve our problems or whether implementing one will increase the complexity again.

In the end, every organization is different. Some are stuck in the early 2000s, and others continuously try to adapt to what comes next. Even within the same organization, you can often find drastic differences. One department might run everything on some VMs in their own data center, while the next might deploy functions within a global CDN or edge provider. Therefore, it’s on you to draw up a vision, strategy, and goal for your platform, if you really need it.

The complexity we experience and the act of putting pressure on engineers needs to be encountered because IT tends to become more complicated over time. In the upcoming section, we will focus on implementing the right solution for developers to overcome that troublesome direction we are heading toward, and which can even lead to burnout.

Implementing developer- and product-focused solutions

Throughout the next few years, we will see an evolution of cloud computing. In this context, platforms will play a crucial role. On the one hand, the cloud will be everywhere, becoming an abstraction for infrastructure. It doesn’t matter whether this is in the form of edge computing or very specialized services or offers. On the other hand, as we have learned in the previous section, we have to focus on delivering environments that enable the best experience possible for developers and other roles, so those people can focus on generating value. Bringing these elements together practically is the key enablement for an IT organization to keep up the speed with the market while delivering continuous value to your company.

The pervasive cloud

The pervasive cloud is not a single solution. It clusters a variety of cloud-computing capabilities that are undergoing a transformative shift to drive business and innovation significantly. The key advancements focus on the integration of cloud technologies anywhere, from private data centers over distributed computational networks to the edge. However, the pervasive cloud goes beyond that. It follows concepts to bridge physical gaps through sensors, IoT components, mobile devices, and other smart connected solutions. Therefore, it is known under other terms such as ubiquitous computing, ambient intelligence, or everywhere.

Gartner, the research company, assumes that six further technologies will shape the pervasive cloud and define its nature [3]:

Augmented FinOps: Combines DevOps methods with cost optimization and budgetingCloud Development Environments (CDEs): Simplify and unify the development environment, reducing human errors and ensuring reproducibilityCloud sustainability: Achieving environmental, social, and economic benefits, reducing the harmful impact of the strongly growing cloud computing tech, and leveraging its power for goodCloud-native: Implementing cloud characteristics as defined beforeCloud-out to the edge: CSP capabilities extended to the edgeWasm: The potential ubiquitous runtime and binary format for everywhere, but not necessarily everything

However, we need to ask why this is now relevant to us as platform engineers, architects, and developers.

First, you can find many of the technologies that we are already working on in these definitions and assumptions. Cloud-native, FinOps, edge, and CDEs are daily realities, while sustainable IT and Wasm have experienced heavy development in recent years. That’s all relevant in making it clear that we are not discussing sci-fi technologies that won’t be attainable within the next 100 years. It’s happening right now and it is ready to be used. We develop and innovate all of those foundations; it just might not be as visible and prominent as GenAI.

Second, to extract the maximum value from cloud investments, businesses must adopt automated operational scaling, leverage cloud-native platform tools, and implement effective governance. These platforms integrate essential services such as SaaS, Platform as a Service (PaaS), and IaaS to create comprehensive product offerings with modular capabilities. IT leaders are encouraged to utilize the modular nature of these platforms to maintain adaptability and agility in the face of rapid market disruptions. Imagine the complexity of such environments without a platform that tames this wide range of motion. Even so, with all that complexity, we need to keep the product mindset in focus, or else it will be hard to provide reliable IT services and solutions in the future.

Figure 1.5: Cloud concepts are found everywhere in a pervasive cloud

Looking at the preceding diagram, you can find elements of the pervasive cloud everywhere. We shouldn’t look at this figure as if those are separate items. Everything is connected. Apps on phones talk with services in the cloud or in local hubs, corporates have multiple networks connecting various computing environments with each other, and we have entirely skipped more progressive concepts such as Web3 here.

Important note

IT as we know it today is undergoing a heavy transformation, both in the visible and invisible spectrum. With every step we take, we increase its complexity while facing demographic pressure and a shortage of professionals. Sooner or later, most companies will be required to have their own platform. If they don’t, they will buy it as a service.

Focusing on developer experience

It is not sustainable to hope that every developer will be able to cover the extremely wide landscape of tools and technologies without burning out within a few years. Therefore, the quality of user experience is pivotal in determining the adoption and success of a platform. A well-designed platform means that it is intuitive, easy to navigate, and aligned with the developers’ expectations and workflows. Enhancing the experience involves streamlining interactions, minimizing friction points, and providing a visually, technically, and functionally pleasing environment. This not only improves user satisfaction but also boosts productivity and engagement. The question is how to achieve this.

We must consider that every developer might have a different preference when designing the platform. It starts directly with the problem of the interaction between the platform and the user. Developers might ask various questions, such as the following. Do we need to set up a portal? Is pushing code on a Git service enough? Can I interact with the platform via CLI? It can be hard to tell, but successful platforms provide all of those interactions. Starting with an API-centric approach will enable any other path to be taken simultaneously. A strong API is the core of a good platform. In reality, most platforms still provide multiple different interfaces. The rapid development of tools to unify this will overcome such challenges and if considered to be built on greenfield, it can be then placed directly into the core.

An example of such a core is Kratix. The Apache 2.0-licensed open source platform describes itself as a “... platform framework for building composable IDPs.” In the following figure, you can see how Kratix positions itself between all the common tools we use today and provides one entry point.

Figure 1.6: Kratix overview as a central integration component

Kratix achieves this through the concept of Promises, which is technically a YAML document that defines a contract between the platform and the users. Every team has to go through a complex onboarding process, not because of the platform itself but because of other dependencies such as CI/CD, Git repositories, and linking everything together. With Kratix Promises, you encapsulate all those steps or combine multiple Promises into one.

Now, Kratix supports simplifying the platform foundation for the developer experience, yet something is missing. The other side of the coin is a developer portal. Backstage is an example of an open source Apache 2.0-licensed solution developed by Spotify. Kratix and Backstage are working well together and integrating seamlessly. Backstage is a framework that enables GUIs to be declaratively created with the aim of unifying infrastructure tooling, services, and documentation to produce a fantastic developer experience. Backstage comes with three core features: the service definition, the Backstage service catalog, and its plugin system, through which you can enable further features such as docs.

Figure 1.7: Backstage’s three core features

At this point, we have seen the challenges that need to be solved, and we have taken a sneak peek into the solution space. That should give us a feeling of the current possibilities before we dive into details throughout the next chapters.

Attributes of platforms

A platform