Front Matter

Praise for this book

Managing Digital is a perfect fit for my Management Information Systems class to introduce students to the fast-paced world of IT Infrastructure that they will be dealing with shortly upon graduation. This book uses multiple perspectives (Founder, Team Leader, VP, C-level executive) to demonstrate to the student not only how a business grows, but how they need to continually grow their skill set. The use of hands-on exercises encouraged by the format of this book complements my teaching style that allows students to learn by doing, failing, and doing again. An additional benefit is that this book begins with a focus on the startup mentality which I will use in my Business Innovation class.

Prof. Pat Paulson, Winona State University

Charles has produced the ultimate roadmap for the 21st century organization. A concise and powerful tool, this book will help anyone assess where they are on their journey today, where they may have gone wrong in the past, and where they should be driving towards today!

Ben Rockwood, Director IT & Operations, Chef

Investments in “digital” are critical for organizations and the economy as a whole. Delivering digital effectively benefits both individuals and communities. This is the time to re-visit and integrate industry guidance and reach a consensus on how digital and IT professionals can best approach their responsibilities. We need deliverance from the dark ages of IT. "Managing Digital" is comprehensive, well written, and enlightening. Charles T. Betz has nailed it once again.

Mark Smalley, The IT Paradigmologist

Managing Digital is a large and significant contribution to computing and IT education. No longer will graduates require on-the-job training after graduation to be current in the workforce. This book merges the requirements of academia with the needs of the modern digital business. Discussions of Lean, DevOps, the domain of the digital business, and more, are all active topics in the marketplace and now they are part of an educational curricula as well.

Chris Little, Consultant

The Open Group Press

About The Open Group Press

The Open Group Press is an imprint of The Open Group for advancing knowledge of information technology by publishing works from individual authors within The Open Group membership that are relevant to advancing The Open Group mission of Boundaryless Information Flow™. The key focus of The Open Group Press is to publish high-quality monographs, as well as introductory technology books intended for the general public, and act as a complement to The Open Group Standards, Guides, and White Papers. The views and opinions expressed in this book are those of the author, and do not necessarily reflect the consensus position of The Open Group members or staff.

About The Open Group

The Open Group is a global consortium that enables the achievement of business objectives through technology standards. Our diverse membership of more than 550 organizations includes customers, systems and solutions suppliers, tool vendors, integrators, academics, and consultants across multiple industries.

The Open Group Vision

Boundaryless Information Flow achieved through global interoperability in a secure, reliable, and timely manner

The Open Group Mission

The mission of The Open Group is to drive the creation of Boundaryless Information Flow achieved by:

  • Working with customers to capture, understand, and address current and emerging requirements, establish policies, and share best practices

  • Working with suppliers, consortia, and standards bodies to develop consensus and facilitate interoperability, to evolve and integrate specifications and open source technologies

  • Developing and operating the industry’s premier certification service and encouraging procurement of certified products

Further information on The Open Group is available at

Managing Digital: Concepts and Practices

Published by The Open Group Press

Copyright © 2018 by The Open Group


This book is made available under a Creative Commons Attribution-NonCommercial license with the addition that commercial and academic licenses are available on request.

Creative Commons License Managing Digital: Concepts and Practices by The Open Group is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Licensing terms for Commercial and Educational (for qualified academic institutions) are available upon request from The Open Group.

This document may contain other proprietary notices and copyright information.

Nothing contained herein shall be construed as conferring by implication, estoppel, or otherwise any license or right under any patent or trademark of The Open Group or any third party. Except as expressly provided above, nothing contained herein shall be construed as conferring any license or right under any copyright of The Open Group.

Note that any product, process, or technology in this document may be the subject of other intellectual property rights reserved by The Open Group, and may not be licensed hereunder.

This document is provided "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT. Some jurisdictions do not allow the exclusion of implied warranties, so the above exclusion may not apply to you.

Any publication of The Open Group may include technical inaccuracies or typographical errors. Changes may be periodically made to these publications; these changes will be incorporated in new editions of these publications. The Open Group may make improvements and/or changes in the products and/or the programs described in these publications at any time without notice.

Should any viewer of this document respond with information including feedback data, such as questions, comments, suggestions, or the like regarding the content of this document, such information shall be deemed to be non-confidential and The Open Group shall have no obligation of any kind with respect to such information and shall be free to reproduce, use, disclose, and distribute the information to others without limitation. Further, The Open Group shall be free to use any ideas, concepts, know-how, or techniques contained in such information for any purpose whatsoever including but not limited to developing, manufacturing, and marketing products incorporating such information.

If you did not obtain this copy through The Open Group, it may not be the latest version. For your convenience, the latest version of this publication may be downloaded at TOG Library.

ArchiMate®, DirecNet®, Making Standards Work®, OpenPegasus®, Platform 3.0®, The Open Group®, TOGAF®, UNIX®, UNIXWARE®, X/Open®, and the Open Brand X® logo are registered trademarks and Boundaryless Information Flow™, Build with Integrity Buy with Confidence™, Dependability Through Assuredness™, EMMM™, FACE™, the FACE™ logo, IT4IT™, the IT4IT™ logo, O-DEF™, O-PAS™, Open FAIR™, Open Platform 3.0™, Open Process Automation™, Open Trusted Technology Provider™, SOSA™, the Open O™ logo, and The Open Group Certification logo (Open O and check™) are trademarks of The Open Group. All other brands, company, and product names are used for identification purposes only and may be trademarks that are the sole property of their respective owners.

DRAFT: Built with asciidoctor, version

Backend: html5

Version 1.0 Build date: 2018-01-29 17:22:43 PST

Epigraph citations: [92], [1], [280]


…​ human beings are inherently storytellers who have a natural capacity to recognize the coherence and fidelity of stories they tell and experience …​ we experience and comprehend life as a series of ongoing narratives, as conflicts, characters, beginnings, middles, and ends.

Walter R. Fisher

Always design a thing by considering it in its next larger context -— a chair in a room, a room in a house, a house in an environment, an environment in a city plan.

Eliel Saarinen

Scaling simply refers, in its most elemental form, to how a system responds when its size changes. What happens to a city or a company if its size is doubled?

Geoffrey West


To all students, past, present, and future


The label "Digital" is applied to a lot of management and technology guidance these days, and risks becoming just another in a long line of technology buzzwords. Isn’t "Digital" just one more step in the long continuum of business exploiting emerging technology? Or, have we crossed some "tipping point" causing a fundamental shift in how we manage such transformation?

My answer would be that there is something different happening this time — digital delivery is becoming core, not context. The convergence of an abundance of computing resources, improving software development management, combined with a change in market focus from the supplier to the customer is changing the way we view Enterprise Architecture and IT management, and identifying the need to develop a digital workforce. The defining characteristics of a digital enterprise are becoming clear:

  • Products or services that are either delivered fully digitally (e.g., digital media or online banking), or where physical products and services are obtained by the customer by digital means (e.g., online car sharing services)

  • A "digital-first" culture, where the business models, plans, architectures, and implementation strategies start from an assumption of digital delivery

  • A workforce who is digitally savvy enough to execute a digital-first approach

About This Book

This book, "Managing Digital: Concepts and Practices", is intended to guide a practitioner through the journey of building a digital-first viewpoint and the skills needed to thrive in the digital-first world. As such, this book is a bit of an experiment for The Open Group; it isn’t structured as a traditional standard or guide. Instead, it is structured to show the key issues and skills needed at each stage of the digital journey, starting with the basics of a small digital project, eventually building to the concerns of a large enterprise. So, feel free to digest this book in stages — the section Introduction for the student is a good guide.

Finally, the electronic publication of this book will be an experiment for The Open Group in another way, in that it is our intent for this book to be a living document updated through crowdsourced content and refreshed periodically in order to keep pace with the rapid evolution of digital business. In parallel with the publication of this book The Open Group Digital Practitioners Work Group will be developing standards and best practices for Digital Practitioners; once those normative standards are published, it is expected that this book will be revised to reflect those.

So, please let us know what you think, and please join our community as a contributor.

David Lounsbury
Chief Technical Officer
The Open Group

Preface by original author

I wrote my first book, Architecture and Patterns for IT Service Management, Resource Planning and Governance: Making Shoes for the Cobbler’s Children in 2006, with a second edition in 2011. I presented the second edition at the national SEI Saturn conference in Minneapolis in 2013, where I was approached by Dr. Bhabani Misra, the head of the Graduate Programs in Software at the University of St. Thomas in St. Paul. Minnesota. Dr. Misra asked me to teach a class called "IT Infrastructure Management," later "IT Delivery," which was to cover not just technical topics but also process and governance.

The course (which has run every semester since January 2013) has been developed during an extraordinary period for IT and digital management. Even in 2013, the trend towards a new style of IT delivery, based on Agile and DevOps practices was notable and accelerating. At this writing, these approaches seem to have “crossed the chasm” in the words of Geoffrey Moore, and are becoming the dominant models for delivering IT value. As this book describes, there are good reasons for this historical shift, and yet its speed and reach are still disorienting.

For three semesters I assigned my first book (Architecture and Patterns for IT: Service Management, Resource Planning, and Governance) as a required text for the class. However, I did not write this as a textbook, and its limitations became clear. While I gave considerable attention to Lean and Agile in writing the book, it has a strongly architectural approach, coming at the IT management problem as a series of views on a model. I do not recommend this as a pedagogical approach for a survey class. It also had a thoroughly enterprise perspective, and I began to question whether this was ideal for new students. Further thought led to the idea of the emergence model (detailed in the Introduction).

I proposed the idea of a third edition to my publisher -— one that would pivot the existing material towards something more useful in class. They agreed to this and I started the rewrite. However, by the time I was halfway done with the first draft, I had a completely new book. The previous work was analytic and technical, while the new book was more conversational, attempting to engage students of a wide background.

A number of factors converged:

  • My view that the “medium is the message,” which extends to the choice of authoring approach, toolchain, and publisher

  • A desire to freely share at least a rough version of the book, both for marketing purposes and in the interests of giving back to the global IT community

  • A desire to be able to rapidly update the book with as little friction as possible

  • A practical realization that the book might get more uptake globally if it were available, at least in some form, as open-sourced intellectual property

  • The fact that I had already started to publish my labs on GitHub and had, in fact, developed a workable continuous delivery (“DevOps”) toolchain based on virtualization for classroom use.

Ultimately, the idea of starting my own publishing company, and managing my own product, appeared both desirable and practical. So I decided to self-publish, via LeanPub. Further developments, including my ongoing participation in the IT4IT™ Reference Architecture, a standard of The Open Group, led me to sign the work over to The Open Group, where it will in part serve as a basis for the new Digital Practitioner Body of Knowledge.

Many assisted with and/or contributed to this work before its transition to The Open Group:

  • Stephen Fralippolippi

  • Roger K. Williams

  • Jason Baker

  • Mark Kennaley

  • Glen Alleman

  • Jeff Sussna

  • Nicole Forsgren

  • Richard Barton

  • Evan Leybourn

  • Chris Little

  • Jabe Bloom

  • Lorin Hochstein

  • Gene Kim

  • Murray Cantor

  • Rob England

  • Firasat Khan

  • Mary Mosman

  • David Bahn

  • Amos Olagunju

  • Svetlana Gluhova

  • Mary Lebens

  • Justin Opatrny

  • Grant Spencer

  • Halbana Tarmizi

  • Will Goddard

  • Terry Brown

  • Francisco Piniero

  • Pat Paulson and his students

  • Majid Iqbal

  • Mark Smalley

  • Ben Rockwood

  • Mark Mzyk

  • Michael Fulton

  • Sriram Narayam

  • John Spangler

  • Tom Limoncelli

  • Kate Campise

  • Dmitry and Alina Kirsanov

Special thanks to Dr. Bhabani Misra for asking me to teach at the University of St. Thomas and providing direction at key points, including challenging me to add a practical component to the course, and to Dave Lounsbury and The Open Group for seeing the merit in this work.

Charles Betz, January 2018

Introduction for instructors and trainers

Welcome to Managing Digital: Concepts and Practices. So, what exactly is this book?

  • It is the first general, survey-level text on IT management with a specific Agile, Lean IT, and DevOps orientation

  • It has a unique and innovative learning progression based on the concept of organizational evolution and scaling

  • Because it is written with continuous integration and print-on-demand techniques, it can be continually updated to reflect current industry trends

The book is intended for both academic and industry training purposes. There has been too much of a gap between academic theory and the day-to-day practices of managing digital products. Industry guidance has over the years become fragmented into many overlapping and sometimes conflicting bodies of knowledge, frameworks, and industry standards. The emergence of Agile and DevOps as dominant delivery forms have thrown this already fractured ecosystem of industry guidance into chaos. Organizations and individuals with longstanding investments in guidance such as ITIL (aka the IT Infrastructure Library). [1] and the Project Management Body of Knowledge (aka PMBOK) are re-assessing these commitments. This book seeks to provide guidance for both new entrants into the digital workforce and experienced practitioners seeking to update their understanding on how all the various themes and components of IT management fit together in the new world.

Digital investments are critical for modern organizations and the economy as a whole. Participating in their delivery (i.e., working to create and manage them for value) can provide prosperity for both individuals and communities. Now is the time to re-assess and synthesize the bodies of knowledge and developing industry consensus on how digital and IT professionals can and should approach their responsibilities.

The IT industry and the rise of digital

Now agile methodologies -— which involve new values, principles, practices, and benefits and are a radical alternative to command-and-control-style management -— are spreading across a broad range of industries and functions and even into the C-suite. [224]
— Darrell Rigby et al.
Harvard Business Review

Consider the following two industry reports.

In September 2015, Minneapolis-based Target Corporation laid off 275 workers with IT skillsets such as business analysis and project management, while simultaneously hiring workers with newer “Agile” skills. As quoted by a local news site, Target stated:

“As a part of our transition to an Agile technology development and support model, we conducted a comprehensive review of our current structure and capabilities … we are eliminating approximately 275 positions and closing an additional 35 open positions. The majority of the impact was across our technology teams and was primarily focused on areas such as business analysis and project management." [149]

Jim Fowler, Chief Information Officer at General Electric, says:

“When I am in business meetings, I hear people talk about digital as a function or a role. It is not. Digital is a capability that needs to exist in every job. Twenty years ago, we broke e-commerce out into its own organization, and today e-commerce is just a part of the way we work. That’s where digital and IT are headed; IT will be no longer be a distinct function, it will just be the way we work. … [W]e’ve moved to a flatter organizational model with “teams of teams” who are focused on outcomes. These are co-located groups of people who own a small, minimal viable product deliverable that they can produce in 90 days. The team focuses on one piece of work that they will own through its complete lifecycle … in [the “back-office”] model, the CIO controls infrastructure, the network, storage, and makes the PCs run. The CIOs who choose to play that role will not be relevant for long.” [122]

Modern Management Information Systems (MIS) courses and textbooks, especially at the undergraduate, survey level, seek to orient all students (whether IT/MIS specialists or not) to the role and function of information systems and their possibilities and value in the modern enterprise. This book, by contrast, intends to prepare the student for a career as a digital professional, in those industries that offer digital products per se as well as industries that rely on digital technology instrumentally for delivering all kinds of products. A central theme of the book is that IT, considered as a component, represents an increasing proportion of all industrial products (both consumer and business-facing). This trend towards IT’s increase is known as Digital Transformation.

Current MIS survey texts have some common characteristics:

  • They tend to focus on the largest organizations and their applications of computing. This can lead to puzzling topic choices; for example, in one current MIS text I reviewed, one of the first sections is dedicated to the problem of enterprise IT asset management -— a narrow topic for the earlier sections of a survey course and increasingly irrelevant in the age of the cloud.

  • Because their coverage area is termed "information systems," many start with extensive and detailed coverage of database management at an enterprise scale, often focused on the classical relational database -— which is now just one choice among many in a world of NoSQL and microservices.

  • They do not (and this is a primary failing) cover Agile and its associated digital ecosystems well, if at all. Brief mentions of Agile may appear in sections on project management, but in general there is a lack of awareness of the essential role of Agile and related methods in accelerating digital transformation.

  • Their coverage of cloud infrastructure can also be limited, even with new editions coming out every year. Topics like infrastructure as code go unaddressed.

  • Finally, current texts often uncritically accept and cite “best practice” IT frameworks such as CMMI, ITIL, PMBOK, and CObIT. New digital organizations do not, in general, use such guidance, and there is much controversy in the industry as to the value and future of these frameworks. This book strives to provide a clear, detailed, and well-supported overview of these issues.

IT, or the digital function, has had a history of being under-managed and poorly understood relative to peer functions in the enterprise. It struggles with a reputation for expensive inflexibility and Dilbert-esque dysfunction, and has generated a fair amount of mistrust with its business sponsors. The DevOps and Agile movements promise transformation but are encountering an entrenched legacy of

  • Enterprise Architecture

  • Program and project management

  • Business process management

  • IT service management practices

  • IT governance concerns

Understanding and engaging with the challenges of this legacy are ongoing themes throughout this introductory text. Some of the more radical voices in the Agile movement sometimes give the impression that the legacy can be simply swept away. Mike Burrows notes that, in terms of the systems thinking at the core of Agile philosophy, such a move might be ill-advised:

“Some will tell you that when things are this bad, you throw it all away and start again. It’s ironic: The same people who would champion incremental and evolutionary approaches to product development seem only too eager to recommend disruptive and revolutionary changes in people-based systems -— in which the outcomes are so much less certain” [45 p. 827].

IT management at scale within an organization is a complex system. The IT workforce, its collective experience, and its ongoing development (through education and training) is another complex system orders of magnitude larger. Complex systems do not respond well to dramatic perturbations. They are best changed incrementally, with careful monitoring of the consequences of each small change. (This is part of the systems theory foundation underlying the Agile movement). This is why the book covers topics such as:

  • Investment, sourcing, and people

  • Project and process management

  • Governance, risk, security, and compliance

  • Enterprise information management

  • Enterprise Architecture and portfolio management

While these practices, and their associated approaches and policies, have caused friction with digital and Agile practitioners, they all have their reasons for existing. The goal of this book is to understand their interaction with the new digital approaches, but in order to do this we must first understand them on their own terms. It does no good to develop a critique based on misconceptions or exaggerations about what (for example) process management or governance is all about. Instead, we try to break these large and sometimes controversial topics into smaller, more specific topics:

  • Work and effort

  • Ordering of tasks

  • Task dependencies

  • Coordination

  • Investment

  • Cost of delay

  • Planned versus unplanned work

  • Estimation versus commitment

  • Value stream versus skill alignment

  • Repeatability

  • Defined versus empirical process control

  • Synchronization and cadence

  • Resource demand

  • Shared mental models

  • Technical debt

  • Risk, etc.

By examining IT management in these more clinical terms, we can develop a responsible critique of current industry best practices that will benefit students as they embark on their careers.

A key choice in the book’s evolution was to NOT include dedicated chapters on “Project Management” and “Process Management.” Instead, more general chapter titles of “Coordination” and “Investment and Planning” were chosen. Rationale for the decision is given in those chapters and in Part III generally. Similarly, there is no one section covering “IT service management” per se; its significant concerns are seen throughout Chapters 4 through 10.

A process of emergence

Joseph Campbell popularized the notion of an archetypal journey that recurs in the mythologies and religions of cultures around the world. From Moses and the burning bush to Luke Skywalker meeting Obi wan Kenobi, the journey always begins with a hero who hears a calling to a quest …

The hero’s journey is an apt way to think of startups. All new companies and new products begin with an almost mythological vision -— a hope of what could be, with a goal few others can see …

Most entrepreneurs feel their journey is unique. Yet what Campbell perceived about the mythological hero’s journey is true of startups as well: However dissimilar the stories may be in detail, their outline is always the same. [31]
— Steve Blank
The Four Steps to Epiphany

One of the most important and distinguishing features of this book is its emergence model. In keeping with the entrepreneurial spirit of works like Ries’ The Lean Startup, the book adopts a progressive, evolutionary approach. The student’s journey through it reflects a process of emergence. Such processes are often associated with founding and scaling a startup. There are many helpful books on this topic, such as the following:

  • Nail It Then Scale It by Furr and Ahlstrom [102]

  • Scaling Up by Harnish [117]

  • Startup CEO by Blumberg [33]

  • The Lean Startup by Ries [223]

  • Hello, Startup by Brikman [35]

The emergence model and overall book structure is discussed in depth in the main introduction. Here, for the instructor, are some notes on the thought process.

A central problem in conceiving the book was the question of overall learning progression, or narrative. As noted in the Preface, it was first developed to support a required semester-long survey class on IT management at the University of St. Thomas in St. Paul, Minnesota, in the largest software engineering program in the country.

Currently, two primary narratives or learning progressions are used to in teaching computing:

  • The “stack”

  • The “lifecycle”

The stack is how the most rigorous topics are taught. Algebra is the foundation for trigonometry, trigonometry is required for calculus, for example. Logic is needed for discrete math, required for automata and compilers, and so forth. The stack is also how technology is described: physical, logical, and conceptual layers, for example, or layered abstractions in networking protocols.

The systems lifecycle, on the other hand, is how we tend to structure industry guidance. We plan and design, we build, we run. Guidance such as CObIT and ITIL show lifecycle influences, as do software engineering programs in colleges.

Kniberg vehicles
Figure 1. Systems evolve and scale iteratively

However, both the stack and the lifecycle have limitations:

  • The stack can fall into reductionism, or what venture capitalist Anshu Sharma calls the “stack fallacy,” the “mistaken belief that it is trivial to build the layer above yours” [245]. A different form of the stack fallacy is seen when practitioners assume that systems can easily be decomposed through layers (business, data, application, technology).

  • The lifecycle narrative is far too prone to promoting waterfall thinking, anathema to the current Agile and Lean Product Development approaches redefining the digital profession.

Instead, this book’s emergence narrative draws on systems theory, in particular John Gall’s idea that “A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system” [103]. Henrik Kniberg created a compelling visual image showing how systems scale: through ongoing, iterative re-design and elaboration of fundamental characteristics (Systems evolve and scale iteratively [2]).

What if we treated the student’s understanding as such a systems scaling problem? What would be the simplest possible thing that could work? How would we iteratively evolve their understanding, based on practical topics? Scaling seems to be orthogonal to the other narratives (Three narrative dimensions).

As we’ll cover in the main introduction, reading books on organizational scaling inspired the idea that growth does not happen smoothly; instead organizations tend to cluster at certain scales and struggle to grow to the next scale. Hence the overall structure of the book:

  • Founder

  • Team

  • Team of Teams

  • Enterprise

narratives cube
Figure 2. Three narrative dimensions

A key focus of the book is explaining what practices are formalized at which level of growth. The thought experiment is, “what would I turn my attention to next as my IT-based concerns scale up?” For example, the book currently proposes that work management (implying rudimentary workflow, e.g., Kanban) correctly comes before formalized project and/or process management, which in turn tend to emerge before enterprise governance practices (e.g., formalized risk management).

Note that this would be a testable and falsifiable hypothesis if empirical research were done to take inventory of and characterize organization scaling patterns. If we found, for example, that a majority of organizations formalize governance, risk, security, and compliance practices before formalizing product management, that would indicate that those chapters should be re-ordered. In my experience, small/medium businesses may have formal product management but Governance, Risk, and Compliance, (GRC) are still tacit, not formalized. This does not mean that GRC is not a concern, but they have not yet instituted formal policy management, internal audit, or controls.

The presence of product management at an early stage in the book (Chapter 4) is intended to provoke thought and debate. Product management is poorly addressed in most current college computing curricula as well as the de facto industry standards (e.g., the TOGAF® ADM, PMBOK, and ITIL). Yet formalizing it is one of the earliest concerns for a startup, and the imperatives of the product vision drive all that comes after. Evidence to this effect is seen (as of 2015) at the University of California at Berkeley I-School, which has replaced its Project Management course with Lean/Agile Product Management, taught currently by Jez Humble, author of Continuous Delivery, Lean Enterprise, and co-author of The DevOps Handbook.

This book however is not a complete dismissal of older models of IT delivery. Wherever possible, new approaches are presented relative to what has gone before. The specifics of “what’s different” are identified, in the interest of de-mystifying what can be fraught and quasi-religious topics. (Why is a Scrum standup or a Kanban board effective, in terms of human factors?)

The emergence model can also be understood as an individual’s progression within a larger enterprise. Even starting from day one at a Fortune 100 corporation, I believe a person’s understanding still progresses from individual, to team, to team of teams, to enterprise. Of course, the person may cease evolving their understanding at any of these stages, with corresponding implications for their career.

This book does not cover specific technologies in any depth. Many examples are used, but they are carefully framed to not require previous expertise. This is about broader, longer lifecycle trends.

There is benefit to restricting the chapters to 12, as a typical semester runs 14 weeks, and the book then fits well, with one chapter per class and allowing for an introductory session and final exam. Of course, a two-semester series, with two weeks per chapter, would also work well. Each half of the book is also a logical unit. The chapters have been re-ordered and refactored many times, and based on further research may evolve further.


With three chapters in each section, the book can be covered in one intense semester at a chapter a week, although expanding it to a two-semester treatment would allow for more in-depth coverage and increased lab exposure. This required new thinking. How could students learn IT management at scale in a lab setting? A hands-on component is essential, as IT management discussions can be abstract and meaningless to many students. (“Incidents are different from problems!”)

Ten years ago, the best that would have been possible would be paper case studies, perhaps augmented with spreadsheets. But new options are now available. The power of modern computers (even lightweight laptops) coupled with the widespread availability of open-source software makes it possible to expose students to industrial computing in a meaningful, experiential way. There is great utility in the use of lightweight, pre-configured virtualization technologies such as Vagrant, VirtualBox, and Docker.

The initial course used a central server, but even that is not necessary. The class can be taught with zero computing budget, assuming that each team of students has access to a modern laptop (recommend 8 gigabytes of RAM and 250 gigabyte drive) and a fast Internet connection. Initial lab versions used free and open-source versions of Chef, Jenkins, Nagios, JUnit, Ant, and other tools, evolving to Node.js, microservices, and Docker with orchestration. See the dm-academy resources on GitHub for the current status.

Some may question the inclusion of command-line experience, but without some common technical platform, it is hard to provide a meaningful, hands-on experience in the first half of the course. Currently, digital professionals require hands-on technology skills; barriers to developing them being much lower than in years past. The initial course assumed that the students are at least willing to learn computing techniques, with no prerequisites beyond that. Not even a programming language is required; the Java currently used as a sample is minimal.

Truly beginning students will have to work at the Linux tutorials, but all they need to master is basic command line navigation; this is possible with a diverse student body, some with no previous computing experience. The labs for the second half of the course mostly use games, experiential paper-based classroom exercises, GUI-based software, databases, and office productivity tools.

The emergence model is intended as the learning progression for either traditional semester-based education, or industry training. Labs tie into the emergence model and encourage the students in hands-on exploration of fundamentals, as if they were starting their own business or initiative. As they progress, they remain grounded in the basics of applied technology. Even the most highly scaled topics, such as Enterprise Architecture and portfolio management, become more real when the students are reminded that the systems in question are simply larger and more numerous versions of the examples they see in the lab.

Introduction for the student

This is a survey text, intended for the advanced undergraduate or graduate student interested in the general field of applied Information Technology (IT) management. It is also intended for the mid-career professional seeking to update their understanding of IT management’s evolution, especially in light of the impact of Agile, Lean, and DevOps.

The book is grounded in basic computing fundamentals but does not require any particular technical skills to understand. You do not need to have taken any courses in networking, security, or specific programming languages to understand this book. However, you occasionally will be presented with light material on such topics, including fragments of programming languages and pseudocode, and you will need to be willing to invest the time and effort to understand.

This book makes frequent reference to digital startups -— early stage companies bringing new products to market that are primarily delivered as some form of computer-based service. Whether or not you intend to pursue such endeavors, the startup journey is a powerful frame for your learning. Large IT organizations in enterprises sometimes gain a reputation for losing sight of business value. IT seems to be acquired and operated for its own sake. Statements like “we need to align IT with the business!” are too often heard.

A digital startup exposes with great clarity the linkage between IT and “the business.” The success or failure of the company itself depends on the adept and responsive creation and deployment of the software-based systems. Market revenues arrive, or do not, based on digital product strategy and the priorities chosen. Features the market doesn’t need? You won’t have the money to stay in business. Great features, but your product is unstable and unreliable? Your customers will go to the competition.

The lessons that digital entrepreneurs have learned through this trial by fire shed great light on IT’s value to the business. Thinking about a startup allows us to consider the most fundamental principles as a sort of microcosm, a small laboratory model of the same problems that the largest enterprises face.

Verne Harnish, in the book Scaling Up [117 pp. 25-26], describes how companies tend to cluster at certain levels of scale. (See Organizations cluster at certain sizes [3]). Of 28 million U.S. firms, the majority of firms (96%) never grow beyond a founder; a small percentage emerge as a viable team of 8-12, and even smaller numbers make it to the stable plateaus of 40-70 and 350-500. The “scaling crisis” is the challenge of moving from one major level to the next. (Harnish uses the more poetic term “Valley of Death.”) This scaling model, and the needs that emerge as companies grow through these different stages, is the basis for this book’s learning progression.

Figure 3. Organizations cluster at certain sizes

However, this is not a textbook (or course) on entrepreneurship. It remains IT-centric. And, the book is also intended to be relevant to students entering directly into large, established enterprises. In fact, it prepares the student for working in all stages of growth because it progresses through these four contexts:

  • Individual (founder)

  • Team

  • Small company (team of teams)

  • Enterprise

Whether your journey begins in a startup or within a larger, established organization, you will (hopefully) become aware as you progress of a broadening context:

  • Other team members

  • Customers

  • Suppliers

  • Sponsors

  • Necessary non-IT capabilities (finance, legal, HR, sales, marketing, etc.).

  • Channel partners

  • Senior executives and funders

  • Auditors and regulators

Part of maturing in your career is understanding how all these relationships figure into your own overall system of value delivery. This will be a lifelong journey for the student; the author’s intent is to provide some useful tools.

This book’s structure

Figure 4. IT management evolutionary model (read bottom to top)

IT management evolutionary model (read bottom to top) is a conceptual illustration of an IT management progression (read the figure bottom to top). Elaborating the outline into chapters, we have:

  1. Founder

    1. Digital value. Why do we need computers? What can they do for us? What are the essential aspects of computer systems: their sponsorship, use, construction, operation, and evolution?

    2. Digital infrastructure. We want to build something. We have to choose a platform first, with attention to fundamentals such as operating systems, storage, and networking. Understanding and choosing among cloud alternatives is immediately required, and managing the configuration of the new digital product is essential.

    3. IT applications delivery. Let’s start building something of use to someone. Software development has evolved from waterfall to Agile, and continuous delivery is now the de facto standard.

  2. Team

    1. Product management. What exactly is it that we are building? What is the process of discovering our customer’s needs and quickly testing how to meet them? How do we better define the product vision, and the way of working towards it, for a bigger team?

    2. Work management. How do we keep track of what we are doing and communicate our progress and needs at the simplest level?

    3. Operations management. How do we sustain this surprisingly fragile digital service in its ongoing delivery of value?

  3. Team of Teams

    1. Coordination. When we have more than one team, they need to coordinate, which we define as “the process of managing dependencies among activities.” There are many techniques to help us coordinate, including process management and Agile concepts. What is the future of process management as a delivery model?

    2. Investment and planning. We make investments in various products, programs, and/or projects, and we are now big enough that we have portfolios of them. How do we decide? How do we choose and work with our suppliers? How do we manage the finances of complex digital organizations? What is the future of project management as a delivery model?

    3. Organization and culture. We’re getting big. How do we deal with this? How are we structured, and why that way? How can we benefit from increasing maturity and specialization while still maintaining a responsive digital product? How do we hire great people and get the most out of them? What are the unwritten values and norms in our company and how can we improve them?

  4. Enterprise

    1. Governance, risk, security, and compliance. We need to cope with structural and external forces such as investors, directors, regulators, vendor partners, security adversaries, and auditors, to whom we are accountable or who are otherwise constraining our options. What are the motivations for governance? How do we understand and control risk? How are we assured that our strategy, tactics, and operations are reasonable, sound, and thorough? And how do we protect ourselves from malicious adversaries?

    2. Enterprise information management. We’ve been concerned with data, information, and knowledge since the earliest days of our journey. But at this scale, we have to formalize our approaches and understandings; without that, we will never capture the full value available with modern analytics and big data. Compliance issues are also compelling us to formalize here.

    3. Architecture and portfolio. We need to understand the big picture of interacting lifecycles, reduce technical debt and redundancy, accelerate development through establishing platforms, and obtain better economies of scale. We do so in part through applying techniques such as visualization, standardization, and portfolio management. However, some of the suggested practices can degrade our teams' performance.

  5. Appendices

    1. The major frameworks

    2. Project management

    3. Process management and modeling

    4. References

    5. Glossary

    6. Backlog

    7. Colophon

    8. Author biography

The boundary between the “Team” and the “Team of Teams” is a challenging area, and industry responses remain incomplete and evolving.

Emergence means formalization

The emergence model seeks to define a likely order in which concerns are formalized. Any concern may of course arise at any time — the startup founder certainly is concerned with security! Formalization means at least one or more of the following:

  • Dedicated resources

  • Dedicated organization

  • Defined policies and processes

  • Automated tooling

In general, startups avoid overly formalized process and project management. To the extent the concerns exist, they are tacit (understood or implied; suggested; implicit). Certainly, a small startup does not invest in an enterprise-class service desk tool supporting a full array of IT management processes or a full-blown project management office with its own vice president and associated portfolio automation. Simple work management, with a manual or automated Kanban board, is likely their choice for tracking and executing tasks.

But by the time they are a team of teams, specialization has emerged and more robust processes and tools are required. Finally, the more complex, enterprise-scale concerns at the end of the book are presented as part of a logical progression.

The danger, of course, is that the formalization effort may be driven by its own logic and start to lose track of the all-critical business context. By carefully examining these stages of maturation, and the industry responses to them, it is the author’s hope that the student will have effective tools to critically engage with the problem of scaling the digital organization.

Finally, the scaling model also emphasizes the critical importance for the reader of the high-performing, multi-skilled, collaborative team. Coordination and enterprise problems must be given their due, but too often the proposed solutions destroy the all important team value. As stated elsewhere in this book, it is possible that there is no higher unit of value in the modern economy than the high-performing team. Maintaining the cohesion and value of this critical asset is presented as a clear priority throughout the subsequent chapters.

Some of you may be familiar with the idea of a Minimum Viable Product (MVP), minimum marketable release, or similar. In these terms, it is important to understand that each section of the book represents an MVP but not each chapter. IT value cannot be delivered without the components discussed in each of chapters 1 through 3. The chapters of each section tend to be interdependent in other words.

Assumptions about the reader

  • This book is written at the advanced undergraduate/graduate student level. It is also intended for mid-career and senior IT practitioners seeking to update their knowledge.

  • It is currently available only in English.

  • There is no assumption of deep IT experience, but it is assumed that the person interacts with computers in some capacity and has basic technical literacy. They should, for example, understand the concepts of operating system and application software. An introductory course on networking or programming, for example, should more than adequately prepare someone for this book.

  • A person completely unfamiliar with computing will need to supplement their reading as suggested throughout the text. There is a wealth of free and accurate information on IT fundamentals (e.g., computing, storage, networking, programming, etc.)., and this book seeks more to curate than replicate.

Part I — Founder

This is the introduction to Part I. In this section, we explore the fundamentals of IT delivery.


You are working in a startup, alone or with one or two partners. You are always in the same room, easily able to carry on a running conversation about your efforts and progress. You have no time or resources to spend on anything except keeping your new system alive and running.

Chapter 1: Digital Value

Chapter 1 introduces you to the fundamental concepts of IT value that serve as a basis for the rest of the course. Why do people want computing (IT) services? What are the general outlines of their structure? How do they come into being? How are they changed over time?

All of this is essential to understand for your scenario; you need to understand what computers can do and how they are generally used if you are going to create a product based on them.

This chapter also covers the basics of how you’ll approach building a product. It’s assumed you won’t develop an intricate, long-range plan but rather will be experimenting with various ideas and looking for fast feedback on their success or failure.

Chapter 2: Digital Infrastructure

In this chapter, you have a general idea for a product and are ready to start building it. But not so fast … you need to decide some fundamentals first. How will your new product run? What will you use to build it?

It’s not possible to begin construction until you decide on your tools. This chapter will provide you with an overview of computing infrastructure including cloud hosting and various approaches to system configuration.

This chapter also presents an overview of source control, as even your infrastructure depends on it in the new world of “infrastructure as code.”

Chapter 3: Application Delivery

Finally, you’re ready to start building something. While this is not a book on software development or programming languages, it’s important to understand some basics and at least see them in action.

This is also where we introduce the concept of “DevOps”; it’s not just about writing code but about the entire end-to-end system that gets the code you are writing from your workstation, into collaborative environments, and finally to a state where it can be accessed by end users. From source repository to build manager to package repository to production, we’ll cover a basic toolchain that will help you understand modern industrial practices.

This section’s lab approach

While this is not a book about any particular computing language or platform, we need to describe some technical fundamentals. We’ll do so in as neutral a manner as possible. However, this book’s accompanying labs are based on Ubuntu Linux and Git, the distributed version control system created by Linus Torvalds to facilitate Linux development.

Part I, like the other parts, needs to be understood as a unified whole. In reality, digital entrepreneurs struggle with the issues in all three chapters simultaneously.

1. Digital Value

1.1. Introduction

As noted at the outset, you are a small core of a startup. Your motivations are entrepreneurial; you want to create a successful business. You might be housed within a larger enterprise, but the thought experiment here is that you have substantial autonomy to order your efforts. You want to do something that has a unique digital component. Regardless of your business, you will need accounting and legal services at a minimum and, very quickly, payroll, HR, and so forth. Those things can (and should) be purchased as commodity services if you are a small entrepreneur (unless you are absolutely on the smallest of shoestring budgets and can work 100-hour weeks). Your unique value proposition will be expressed to some degree in unique IT software. While this software may be based on well-understood products, the configuration and logic you construct will be all your own. Because of this, you are now a producer (or soon to be) of IT services.

Before we can talk about building and managing IT-based (aka digital) products, we need to understand what IT is and why people want it. We’ll start this chapter by looking at an IT value experience that may seem very familiar. Then we’ll dig further into concepts like the IT stack and the IT service and how they change over time.

1.1.1. Chapter outline

  • An IT value experience

  • What is IT?

  • The IT service and the IT stack

  • The IT service

  • IT changing over time

  • The digital context

  • Conclusion

1.1.2. Learning objectives for this chapter

  • Explain “IT value” in everyday terms

  • Distinguish between IT service and IT system

  • Discuss how IT services change over time

  • Describe various ways of understanding the context in which digital systems are developed and digital value is delivered

1.2. What is digital value?

1.2.1. A digital value scenario

women w/cell phones
Figure 5. Dinner out tonight?

Consider the following scenario:

A woman (Dinner out tonight? [4]) is wondering if she can afford to dine out that evening. She uses her mobile device to access her banking information and determines that in fact she does have enough money to do so. She also uses her mobile device to make a reservation and contact some friends to join her. Finally, she uses social navigation software to avoid heavy traffic, arriving at the restaurant in time for an enjoyable evening with her friends.

IT pervaded this experience. The origins, layers, and complex connections of the distributed systems involved are awe-inspiring to consider.

Don’t worry about the technological terms for now.

This is an introductory text. You may see terms below that are unfamiliar (model-view-controller, IP, packet switching). If you are reading this online, you can follow the links, but it’s not required. As you progress in your career, you will always be encountering new terminology. Part of what you need to learn is when it’s important to dig into it and when you can let it pass for a time. You should be able to understand the gist presented below that these are complex systems based on a wide variety of technologies, some of them old, some new.

The screen on her cell phone represents information accessed and presented by a complex "architecture" of distributed systems, including her cell phone’s hardware and operating system, which run software written and deployed by the bank, which communicates with computers ("servers") in the bank’s data center. The communication with her bank’s central systems is supported by 4G LTE data which in turn relies on the high-volume IP backbone networks operated by the telecommunications carriers, based on research into packet switching now approaching 50 years old. The application operating on the cell phone interacts with core banking systems via sophisticated and highly secure middleware, crossing multiple network control points. This middleware talks in turn to the customer demand deposit system that still runs on the mainframe.

The mainframe is now running the latest version of IBM’s z/OS operating system (a direct descendant of OS/360, one of the most significant operating systems in the history of computing). The customer demand deposit banking application running on the mainframe is still based on code written in the lowest level assembler. Some of the comments in this code date back to the 1970s. It has been tuned and optimized over the decades into a system of remarkable speed and efficiency. Although re-platforming it is periodically discussed, the cost/benefit ratio for such a project has to date not been favorable.

Figure 6. Digital made this gathering easier

The reservation system looks similar on the mobile device, but the network routes it to a large cloud data center hosting the reservation system. The back-end application here is very different from the banking system; the programming languages are newer, the database is structured very differently, and the operating system is Linux.

Finally, the navigation software looks much like the reservation system, as it too is based on the cloud. However, the system is much more active as it is continually processing inputs from millions of drivers in thousands of cities and updating traffic maps for those drivers in real time so that they can choose the most optimal route to their destinations (e.g., dinner). The capabilities of this system are comparable to an air traffic control system, and yet it is available as a free download for our IT user.

The resulting value (as in Digital made this gathering easier [5]) is clear:

  • In an earlier era, our user might have stayed in for fear of bouncing a check, or she might have gone out and dined beyond her means.

  • The phone line at the restaurant might have been busy, so she might have risked showing up with no reservation.

  • Before texting and social media, she might not have been able to reach her friends as easily.

  • Without the traffic application, she might have run into a huge midtown traffic jam and been half an hour late.

Clearly, IT added value to her life and helped maximize her experience of social enjoyment.

1.2.2. Various forms of digital value

As we have seen, there are many ways in which digital systems deliver value. Some systems serve as the modern equivalent of file cabinets: massive and secure storage for financial transactions, insurance records, medical records, and the like. Other systems enable the transmission of information around the globe, whether as emails, web pages, voice calls, video on-demand, or data to be displayed in a smartphone application (app). Some of these systems support engaged online communities and social interactions with conversations, media sharing, and even massive online gaming ecosystems. Yet other systems enable penetrating analysis and insight by examining the volumes of data contained in the first two kinds of systems for patterns and trends. Sophisticated statistical techniques and cutting-edge approaches like neural network-based machine learning increase the insights our digital systems are capable of, at a seemingly exponential rate.

Digital technology generates value in both direct and indirect ways. People have long consumed (and paid for) communication services, such as telephone services. Broadcast entertainment was a different proposition, however. The consumer (the person with the radio or television) was not the customer (the person paying for the programming to go out over the airwaves). New business models sprung up to support the new media through the sale of advertising air time. In other words, the value proposition was indirect, or at least took multiple parties to achieve: the listener, the broadcaster, and the advertiser. Finally, some of the best known uses of digital technology were and are very indirect -— for example, banks and insurance agencies using the earliest computers to automate the work of thousands of typists and file clerks.

From these early business models have evolved and blossomed myriads of creative applications of digital technology for the benefit of human beings in their ongoing pursuit of happiness and security. We see the applications mentioned at the outset: online banking, messaging, restaurant reservations, and traffic systems. Beyond that we see the use of digital technology in nearly every aspect of life.

Digital and IT pervades all of the major industry verticals (e.g., manufacturing, agriculture, finance, retail, healthcare, transportation, services) and common industry functions (e.g., supply chain, human resources, corporate finance, and even IT itself). Digital systems and technologies also are critical components of larger-scale industrial, military, and aerospace systems. For better or worse, general-purpose computers are increasingly found controlling safety-critical infrastructure and serving as an intermediating layer between human actions and machine response. Robotic systems are based on software, and the Internet of Things ultimately will span billions of sensors and controllers in interconnected webs monitoring and adjusting all forms of complex operations across the planet.

1.3. Defining digital

In order to define digital, let’s start with discussing Information Technology, commonly abbreviated "IT."

1.3.1. What is IT, anyways?

We’ve started this book in the previous section by providing an example of digital or IT value, without much discussion of how it is delivered. This is deliberate. But what is IT, anyways?

  • The computers? The networks?

  • The people who run them?

  • That organization under a Chief Information Officer that loves to say “no” and is always slow and expensive?

None of these are how this book defines “IT.” Although this is not a technical book on computer science or software engineering, the intent is that it reflects and is compatible with foundational principles.

“Information Technology” is ultimately based on the work of Claude Shannon, Alan Turing, Alonzo Church, John von Neumann, and the other pioneers who defined the central problems of information theory, digital logic, computability, and computer architecture.

Additionally, as an organizational function, IT also draws on organizational theory, systems theory, human factors and psychology, and more recent concepts such as design thinking, among many other areas. Discussions of “IT” become contentious because some think of the traditional organization, while others think of the general problem area. IT has a long history as a corporate function, a single hierarchy under a powerful Chief Information Officer. This model has had its dysfunctions, including a reputation for being slow and expensive. Often, when one encounters the term “IT,” the author using the term is referring to this organizational tradition.

We are less interested in the future of IT as a distinct organizational structure. There are many different models, from fully centralized to fully embedded. Organizational structure will be discussed in Part III.

For this book, we define “Information Technology” in terms of its historic origins. We look to IT’s common origins in automating the laborious and error-prone processes of computation, through the application of digital logic technologies based on information transmission.

Regardless of organizational form or delivery methods, IT is defined by these origins. It does not matter if the application developers and systems engineers ultimately report up through the CIO, the CMO, the CFO, or the COO. There are common themes throughout IT and digital as a professional domain: the fragility and complexity of these systems, the need for layered abstractions in their management, and more.

1.3.2. IT and digital transformation

IT doesn’t matter. [54]
— Nicholas Carr
Software is eating the world. [15]
— Mark Andreessen
The digital realm is infusing the physical realm, like tea in hot water. [262]
— Jeff Sussna
Designing Delivery

IT increasingly permeates business operations and social interactions. The breadth and depth of IT support for virtually all domains of society continues to expand. Lately, this is known as Digital Transformation [281].

The role of IT seems critical to society and the economy, but there are various points of view. Nicholas Carr, in his controversial Harvard Business Review article “IT Doesn’t Matter,” recognized that IT was becoming commoditized in an important sense [54]. As cloud providers started to offer utility-style computing, the choice of particular vendors of computers was no longer strategic. Looking to history, Carr argued that just as businesses no longer have “Vice Presidents for Electricity,” so businesses no longer need Chief Information Officers or dedicated IT departments.

A “commodity” product is one that is offered from a variety of suppliers, with little or no difference between their offerings. Commodity products tend to compete on price, not on differences in features. Wheat is a commodity. Sports cars are not. “Commoditization” is the process by which products that used to compete by being different, increasingly compete on price.

Carr has insight -— there is no question IT is becoming pervasive -— but he ultimately reflects a narrow view of what “IT” is. If “IT” were merely computation at the lowest level -— just shuffling bits of information around, doing a little math -— then perhaps it could be embedded throughout a business like electricity.

But IT has emergent aspects that are not comparable to electrical power. As it pervades all dimensions of business operations, it brings its concerns with it: complexity, fragility, and the skills required to cope with them.

One watt of electrical power is like any other watt of electrical power, and is therefore a commodity. We can use it to run toasters, hair dryers, or industrial paint mixers, and there is little concern (beyond supply and demand management) that the consumption of power by the paint mixer will affect the toaster. It’s also true that one cycle of computing, in a certain sense, is like any other cycle. But IT systems interact with each other in surprising and unpredictable ways, orders of magnitude more complex than electrical power grids. (This is not to imply the modern electrical grid is a simple system!)

IT also radically transforms industries: from retail to transportation to manufacturing to genetics. Applied software-centric IT is unleashing remarkable economic disruption.

A lawyer may depend on a cell phone, but (in keeping with Carr) beyond its provision as a commodity service, it’s not a differentiator in delivering the legal strategies a firm needs. A graphic designer may use computerized graphic tools, but these have become relatively standardized and commoditized in the past 20 years, and probably are not a source of competitive advantage in the quest for new marketing clients.

On the other hand, consider a text analytic algorithm that replaces thousands of paralegals, resulting in order-of-magnitude more accurate legal research in a fraction of cost and time. This is strategic and disruptive to the legal community. A superior supply chain algorithm, and the ability to improve it on an ongoing basis, may indeed elevate a logistics firm’s performance above competitors. In cases like these -— and they seem to be increasing -— IT matters very much. The annual State of DevOps research finds that “Firms with high-performing IT organizations were twice as likely to exceed their profitability, market share, and productivity goals.” [95]

In the digitally transforming economy, traditional “back office” IT organizations find themselves called on to envision, develop, and support market-facing applications of IT. And what starts with one market-facing use case can quickly expand into entire portfolios. It is such cases that are of particular concern in this book. Ultimately, it is possible that IT is the most strategic capability an organization can invest in. As Diomidis Spinellis, editor in chief of IEEE Software notes [257],

“…other industries are also producing what’s in effect software (executable knowledge) but not treating it as such … Although many industries have developed their own highly effective processes over the years, software engineering maintains an essential advantage. It has developed methods and tools that let even small teams manage extremely high complexity … This advantage is important because the complexity in non-software activities is also increasing inexorably … the time has come to transform our world … by giving back to science and technology the knowledge software engineering has produced.”

This ability to manage complexity, to turn tacit into explicit and formalize the previously unstructured, is an essential aspect of Digital Transformation.

1.3.3. Defining “IT”

So, how do we define an IT problem, as opposed to other kinds of business problems? An IT problem is any problem where you are primarily constrained by your capability for and understanding of IT.

  • If you need computer scientists or engineers who understand the fundamentals of information theory and computer science, you are doing IT.

  • If you need people who understand when your information-centric problems might need to be referred to such theorists and engineers, you are likely doing IT.

  • If you need people who are skilled in building upon those fundamentals, and operating technical platforms derived from them (such as programming languages, general-purpose computers, and network routers), you are doing IT.

Regardless of whether IT is housed under a traditional CIO, an operations capability, a Chief Marketing Officer (CMO), or a “line of business”, when it is critical to operations certain concerns inevitably follow:

  • Requirements (i.e., your intent for IT)

  • Sourcing and provisioning

  • IT-centric product design and construction

  • Configuration and change management

  • Support and operations

  • Security

  • Improvement

Newcomers who propose changes to these practices in hopes of making IT more “Agile” are often surprised to find that these concerns were not mere bureaucracy, but instead had well-grounded origins in past failures. Ignoring these lessons is perilous. And yet, the traditional, process-heavy IT organization does seem dysfunctional from a business point of view: a central theme of this book.

1.3.4. IT versus "Digital"

So, is there any real difference between "IT" and "Digital"? Perception is important, and IT as a term is becoming outdated. It is associated with an earlier generation of centralized computers running back-office systems. "Digital," on the other hand, is associated with the everyday experience of computers and software, pervasive throughout our lives. "Digital" of course is based on IT, if we are to be precise in our use of words. But the social and marketing connotations are clearly distinct.

This book is not completely precise in distinguishing the two terms, and this is to some extent deliberate. But in general, if you think of "IT" as more structured and internal, and "digital" as more market-facing, you will be on the right track.

1.4. Digital services, systems, and applications

1.4.1. Inside a digital service

woman at computer
Figure 7. The basis of digital value

Let’s examine our diner’s value experience (The basis of digital value [6]) in more detail, without getting unnecessarily technical, and clarify some definitions along the way. The first idea we need to cover is the "moment of truth.” In terms of IT, this English-language cliche (elaborated into an important service concept by SAS group president Jan Carlzon [53]) represents the user’s experience of value.

In the example, our friend seeking a relaxing night out had several moments of truth:

  • Consulting her bank balance, and subsequent financial transactions also reflecting what was stated to her

  • Making a reservation and having it honored on arrival at the restaurant

  • Arriving on time to the restaurant, courtesy of the traffic application

  • And most importantly, having a relaxed and refreshing time with her friends

Each of these individual value experiences was co-created by our friend’s desire for value, and the response of a set of IT resources.

The “moment of truth” represents the user’s experience of value from a product.

In order to view her balance, our user is probably using an application downloaded from a “store” of applications made available to her device . On her device, this “app” is part of an intricate set of components performing functions such as:

  • Accepting “input” (user intent) through a screen or voice input

  • Processing that input through software and acting on her desire to see her bank balance

  • Connecting to the phone network

  • Securely connecting over the mobile carrier network to the Internet and then to the bank

  • Identifying the user to the bank’s systems

  • Requesting the necessary information (in this case, an account balance)

  • Receiving that information and converting it to a form that can be represented on a screeen

  • Finally, displaying the information on the screen

The application, or “app,” downloaded to the phone plays a primary role, but is enabled by:

  • The phone’s operating system and associated services

  • The phone’s hardware

  • The telecommunications infrastructure (cell phone towers, long distance fiber optic cables, switching offices, and much more)

Of course, without the banking systems on the other end, there is no bank balance to transmit. These systems are similar, but on a much larger scale than our friend’s device:

  • Internet and middleware services to receive the request from the international network

  • Application services to validate the user’s identity and route the request to the appropriate handling service

  • Data services to store the user’s banking information (account identity and transactions) along with millions of other customers

  • Many additional services to detect fraud and security attacks, report on utilization, identify any errors in the systems, and much more

  • Physical data centers full of computers and associated hardware including massive power and cooling infrastructure, and protected by security systems and personnel

Consider: what does all this mean to our user? Does she care about cell phone towers, or middleware, or triply-redundant industrial-strength Power Distribution Units? Usually, not in the least. Therefore, as we study this world, we need to maintain awareness of her perspective. Our friend is seeking some value that IT uniquely can enable, but does not want to consider all the complexity that goes into it. She just wants to go out with friends. The moment of truth (The IT stack supports the moment of truth) depends on the service; the service may contain great complexity, but part of its success lies in shielding the user from that complexity.

Figure 8. The IT stack supports the moment of truth
Always remember the user’s experience. Information technology has a well deserved reputation for being too complicated for end users—​for example, trying to do something that should be simple, and finding oneself in a technical conversation about network settings.

1.4.2. What versus how

This fundamental tension between what a system is supposed to do, versus how it does it, pervades IT management and will likely define your career. “Don’t trouble me with the details, just give me the results” is the overall theme, and we encounter this reaction to complexity in many aspects of life.

Terminology is important. We need to have a more precise way of describing the IT, beyond just saying there is “lots” of it. A variety of terms are used in this text:

  • Digital product

  • IT service

  • Application

  • IT system

  • IT infrastructure

We also see discussion of components, resources, subsystems, assets, and many more terms.

There are many debates around these definitions. Sometimes these debates are helpful in clarifying the terminology you want to use on your team. But sometimes the debates don’t add any value. Beware of anyone who claims there is a “best practice” here.

In general, in this book, we will use the following definitions:

  • A digital or IT service is defined primarily in terms of WHAT not HOW

  • Defining an IT system may include a discussion of both WHAT it does and HOW it does it

  • An “application” usually means some IT service or system for end users who are not primarily concerned with IT other than wanting to get something done with it (e.g., go out to dinner)

  • “Infrastructure” usually means some IT service or system that primarily supports OTHER IT services or systems (e.g.,a network “service” is not usually useful to end users without additional application services)

Finally, the concept of the “IT stack” is important. Notice how the different technology layers appear “stacked” in The IT stack supports the moment of truth. Layered approaches to understanding IT are common, such as the OSI model and the Zachman framework (see Further Reading for useful references).

Author’s note: Service versus product

For the purposes of this book, “IT services” are equivalent to “products.” Products are goods, services, or some combination (your experience at the local coffee shop includes both). They have some feature or features that provides value for some customer. You may, in other contexts, hear phrases like “products versus services,” which imply that they are distinct. Usually, when products are contrasted with services, people are equating products with goods. For example, they will say that a jar of peanut butter is a “product,” while a haircut is a service.

However, when one of the authors worked at AT&T, the internal term for offerings like broadband networking access was not a “service,” but a “product.” Services, in this sense, are products.

In this book, we see products and services as roughly equivalent, but the two terms have some different connotations. Products usually imply an external market, where services can be either internal or external facing. While we certainly talk about “product marketing", the term “service marketing” is rarely seen. Furthermore, some companies have recently re-conceptualized internal services organizations as “product teams.”

1.5. The digital service lifecycle

IT States
Figure 9. The essential states of the digital service (or product)

We’ve established that the digital or IT service is based on a complex stack of technology, from local devices to global networks to massive data centers. Software and hardware are layered together in endlessly inventive ways to solve problems people did not even know they had ten years ago. However, these IT service systems must come from somewhere. They must be designed, built, and operated, and continually improved over time. A simple representation of the IT service lifecycle is shown in The essential states of the digital service (or product).

  1. It starts with an idea. Someone has an insight into an IT-enabled value proposition that can make a profit, or better fulfill a mission.

  2. The idea must garner support and resources so that it can be built.

  3. The idea is then constructed, at least as an initial proof of concept or Minimum Viable Product (in the language of Ries' Lean Startup). Construction is assumed to include an element of design; in this textbook, construction and design are not represented as two large-scale separate phases. The activities may be distinct, but are conducted within a context of faster design-build iterations.

  4. There is a critical state transition however that will always exist. Initially, it is the change from OFF to ON when the system is first constructed, activated, and made available for access. After the system is ON, there are still distinct changes in state when new features are deployed, or incorrect behaviors (“bugs” or “defects”) are rectified.

  5. The system may be ON, but it is not delivering value until the user can access it. Sometimes, that may be as simple as providing someone with a network address, but usually there is some initial “provisioning” of system access to the user, who needs to identify themselves.

  6. The system can then deliver services (moments of truth) to the end users. It may deliver millions or billions of such experiences, depending on its scale and how one might choose to count the subjective concept of value experience.

  7. The user may have access, but may still not receive value, if they do not understand the system well enough to use it. Whether via a formal service desk, or informal social media channels, users of IT services will require and seek support on how to maximize the value they are receiving from the system.

  8. Sometimes, support requests indicate that something is wrong with the system itself. If the system is no longer delivering value experiences (bank balances, restaurant reservations, traffic directions) then some action must be taken promptly to restore service.

  9. All of the previous states in the lifecycle generate data and insight that lead to further evolution of the system. There is a wide variety of ways systems may evolve: new user functionality, more stable technology, increased system capacity, and more. Such motivations result in new construction and changes to the existing system, and so the cycle begins again.

  10. …​ Unless …​ the system’s time is at an end. If there is no reason for the system to exist any longer, it should be retired.

System retirement is often more complex and expensive than expected, and there are many examples of systems surviving well beyond the point they deliver value.

So we see that the digital service/product evolves over time, through many repetitions (“iterations”) of the improvement cycle. This entire process, from idea to decommissioning (“inspire to retire”) can be understood as the service lifecycle (The digital service lifecycle).

IT service changes
Figure 10. The digital service lifecycle

We can combine the service experience (moment of truth) with the service/product lifecycle into the “dual-axis value chain” (originally presented in [25]):

dual axis value chain
Figure 11. Dual-axis value chain

The dual-axis value chain can be seen in many representations of IT and digital delivery systems. Product evolution flows from right to left, while day-to-day value flows up, through the technical stack. It provides a basis for (among other things) thinking about the IT user, customer, and sponsor, which we will cover in the next section.

1.6. Defining consumer, customer, and sponsor

In understanding IT value, it is essential to clarify the definitions of user, customer, and sponsor, and understand their perspectives and motivations. Sometimes, the user is the customer. But more often, the user and the customer are different, and one may additionally need to distinguish the role of system or service sponsor.

The following definitions may help:

  • The consumer (sometimes called the user) is the person actually interacting with the IT or digital service.

  • The customer is a source of revenue for the service. If the service is part of a profit center, the customer is the person actually purchasing the product (e.g.,demand deposit banking). If the service is part of a cost center (e.g.,an HR system), the customer is best seen as an internal executive, as the actual revenue-producing customers are too far removed.

  • The sponsor is the person who authorizes and controls the funding used to construct and operate the service.

Depending on the service type, these roles can be filled by the same or different people. Here are some examples:

Table 1. Defining consumer, customer, and sponsor
Example Consumer Customer Sponsor Notes

Online banking

Bank account holder

Vice president, consumer banking

Customer-facing profit center with critical digital component

Online restaurant reservation application

Restaurant customers

Restaurant owners

Product owner

Profit-making digital product

Enterprise human resources application

HR analyst

Vice president, HR

Cost center funded through corporate profits

Online video streaming service

End video consumer (e.g.,any family member)

Streaming account holder (e.g.,parent)

Streaming video product owner

Profit-making digital product

Social traffic application


Advertiser, data consumer

Product owner

Profit-making digital product

So, who paid for our user’s enjoyment? The bank and restaurant both had clear motivation for supporting a better online experience, and people now expect that service organizations provide this. The bank experiences less customer turnover and increased likelihood that customers add additional services. The restaurant sees increased traffic and smoother flow from more efficient reservations. Both see increased competitiveness.

The traffic application is a somewhat different story. While it is an engineering marvel, there is still some question as to how to fund it long term. It requires a large user base to operate, and yet end consumers of the service are unlikely to pay for it. At this writing, the service draws on advertising dollars from businesses wishing to advertise to passersby, and also sells its real-time data on traffic patterns to a variety of customers, such as developers considering investments along given routes.

This last example well illustrates the maxim (attributed to media theorist and writer Douglas Rushkoff [256]) that “if you don’t know how the product is making money, you are the product.”

1.7. Understanding digital context

most startups can only guess who their customers are and what markets they are in [31 p. 37].
— Steve Blank
The Four Steps to the Epiphany

As you consider embarking on a journey of IT or digital value, you need to orient to your surroundings and create an initial proposal or plan for how you will proceed. If you are actually a startup, you need a business plan. If you are working as an intrapreneur in a larger organization, you will still need some kind of formal proposal. This section describes some tools and thinking approaches that may be useful at this very earliest stage. There are more focused, product-specific approaches in the Chapter 4 section on product discovery techniques.

1.7.1. Market-facing, supporting, back office

In the previous section we discussed the question of “who pays/who benefits” for the service, proposing that the service consumer, the service customer, and the service sponsor might be three distinct roles (sometimes collapsing into two or one individuals).

We see this again in how we can categorize the "customers” of IT services and systems. Roughly, such services can be:

  • Directly market- and consumer-facing (e.g., Facebook), to be used by external consumers and paid for by either them or closely associated customers (e.g., Netflix, or an online banking system).

  • Customer “supporting” systems, such as the online system that a bank teller uses when interacting with a customer. Customers do not interact directly with such systems, but customer-facing representatives do, and problems with such systems may be readily apparent to the end customer.

  • Completely “back-office” systems (HR, payroll, marketing, etc.).

Note, however, that (especially in the current digitally transforming market) a service previously thought of as “back-office” (when run internally) becomes “market-facing” when developed as a profit-seeking offering. For example, an HR system built internally is “back-office,” but Workday is a directly market-facing product, even though the two services may be similar in functionality. In other words, it’s all relative. Especially when products are not market-facing, we start to run into the problem of distinguishing discovery versus design, as we discuss below.

1.7.2. Diffusion theory and other approaches

As you start to think about digital value, you must think about the context for your startup or product idea. What is the likelihood of its being adopted? Is it part of a broader “movement” of technological innovation? Where is the customer base in terms of its willingness to adopt the innovation? A well known approach is the idea of "diffusion theory,” first researched by Everett Rogers and proposed in his Diffusion of Innovations [226].

Rogers' research proposed the idea of “Adopter Categorization on the Basis of Innovativeness,” with a well-known graphic (Technology adoption categories[7] (Figure 7-3, p. 281)]).

Figure 12. Technology adoption categories

Rogers went on to characterize the various stages:

  • Innovators: Venturesome risk-takers

  • Early adopters: Opinion leaders

  • Early majority: Deliberative, numerous

  • Late majority: Skeptical, also numerous

  • Laggards: Traditional, isolated, conservative

Rogers' figure was popularized in the following variation by Geoffrey Moore [189] in his bestseller Crossing the Chasm (Purported “chasm” between adopter categories [8] (p. 21)]).

Figure 13. Purported “chasm” between adopter categories

You’ll see Moore’s variation often, but you should be aware that Rogers himself (the person who has researched the data) says that (p. 282):

Past research shows no support for this claim of a “chasm” between certain adopter categories. On the contrary, innovativeness, if measured properly, is a continuous variable and there are no sharp breaks or discontinuities between adjacent adopter categories (although there are important differences between them).

The idea of technology diffusion frames the problem for us, but we need more. Steve Blank, in his influential book The Four Steps to Epiphany [31 p. 31], argues there are four categories for startups:

  • Startups that are entering an existing market

  • Startups that are creating an entirely new market

  • Startups that want to re-segment an existing market as a low-cost entrant

  • Startups that want to re-segment an existing market as a niche player

Understanding which category you are attempting is critical, because “the four types of startups have very different rates of customer adoption and acceptance.”

Another related and well known categorization of competitive strategies comes from Michael Treacy and Fred Wiersma [269]:

  • Customer intimacy

  • Product leadership

  • Operational excellence

It is not difficult to categorize well known brands in this way:

Table 2. Companies and their competitive strategies
Customer intimacy Product leadership Operational excellence



Dell Technologies

Home Depot



However, deciding which strategy to pursue as a startup may require some experimentation.

1.7.3. Business discovery approaches

Startups that survive the first few tough years do not follow the traditional product-centric launch model espoused by product managers or the venture capital community … In particular, the winners invent and live by a process of customer learning and discovery. I call this process “Customer Development,” a sibling to “Product Development,” and each and every startup that succeeds recapitulates it, knowingly or not.
— Steve Blank
The Four Steps to Epiphany

Let’s start with two well known approaches that can help you bridge from an understanding of your product context, to an effective vision for building and sustaining a product:

  • Alexander Osterwalder’s Business Model Canvas

  • Eric Ries' Lean Startup

Business model canvas

One recent book that’s been influential among enterpreneurs is Alex Osterwalder’s Business Model Generation [206]. This book is perhaps best known for introducing the concept of the Business Model Canvas, which it defines as “A shared language for describing, visualizing, assessing, and changing business models.” The Business Model Canvas uses nine major categories to describe the business model:

  • Key Partners

  • Key Activities

  • Value Proposition

  • Customer Relationships

  • Customer Segments

  • Key Resources

  • Channels

  • Cost Structure

  • Revenue Streams

and suggests they be visualized as in Business Model Canvas [9]]

business model canvas
Figure 14. Business Model Canvas

The canvas is then used in collaborative planning, e.g., as a large format wall poster where the business team can brainstorm, discuss, and fill in the boxes (e.g., what is the main “Value Proposition"? Mobile bank account access?).

author’s business model canvas
Figure 15. Rough approximation of author’s Business Model Canvas

Osterwalder and his colleagues, in Business Model Generation and the follow-up Value Proposition Design [207], suggest a wide variety of imaginative and creative approaches to developing business models and value propositions, in terms of patterns, processes, design approaches, and overall strategy.

Business case analysis

There are a wide variety of analysis techniques for making a business case at a more detailed level. Donald Reifer, in Making the Software Business Case [219], lists:

  • Breakeven analysis

  • Cause-and-effect analysis

  • Cost/benefit analysis

  • Value chain analysis

  • Investment opportunity analysis

  • Pareto analysis

  • Payback analysis

  • Sensitivity analysis

  • Trend analysis

A primary theme of this book is that empirical, experimental approaches are essential to digital management. Any analysis, carried to an extreme without a sound basis in real data, risks becoming a “castle in the air.” But when you are putting real money on the line (even the opportunity costs of the time you are spending on your startup), it is advisable to look at the decision from various perspectives. These techniques can be useful for that purpose. However, once you have some indication there might be business value in a given idea, applying Lean Startup techniques may be more valuable than continuing to analyze.

Lean Startup
The goal of a startup is to figure out the right thing to build — the thing customers want and will pay for — as quickly as possible. In other words, the Lean Startup is a new way of looking at the development of innovative new products that emphasizes fast iteration and customer insight, a huge vision, and great ambition, all at the same time.
— Eric Ries
The Lean Startup
Lean Startup flowchart
Figure 16. Lean Startup flowchart

Lean Startup is a philosophy of entrepreneurship developed by Eric Ries [223]. It is not specific to IT; rather, it is broadly applicable to all attempts to understand a product and its market. (According to our definition of product management a workable market position is essential to any product).

The idea of the Lean Startup has had profound influence on product design, including market-facing and even internal IT systems. It is grounded in Agile concepts such as:

“Do the simplest thing that could possibly work.”

Lean Startup calls for an iterative “Build-Measure-Learn” cycle (Lean Startup flowchart [10]]). Repeating this cycle frequently is the essential process of building a successful startup (whatever the digital proportion).

  • Develop an idea for a Minimum Viable Product (MVP)

  • Measure its effectiveness in the market (internal/external)

  • Learn from the experiment

  • Decide to persevere or pivot (change direction while leveraging momentum)

  • New idea development, evolution of MVP

Flowcharts such as the one shown are often seen to describe the Lean Startup process. We will go into much more depth on product management in Chapter 4 and Part III.

IT exists in a social context

Like the proverbial fish that doesn’t understand water (because water it all it knows), we may lose sight of the laws and social institutions that enable us to use computers in the ways covered in this chapter. For example, the ability for banks to hold money as electronic bits on a computer is rooted in the earliest history of banking and the emergence of centralized settlement and clearing mechanisms. Cell phone companies rely on international treaties, and national laws and regulations allocating radio spectrum. Patent and copyright law support the market for commercial software.

The existence of physical voice and data connectivity relies on laws supporting utility easements and rights of way, and even treaties such as the Law of the Sea. (How is it that undersea cables remain unmolested?) More broadly, the entire technological infrastructure relies on education, easily disrupted supply chains, market demand, and a functioning economy. Advances in digital technology require trained scientists and engineers in many fields—​not only computer science, but electrical engineering, materials science, chemical engineering, and many others. The institutions that produce these highly educated practitioners are not easily or quickly scaled.

In short, without a social infrastructure to support it, advanced technology cannot exist.

1.8. Conclusion

In this chapter, we discussed the basic questions of IT value and how it is experienced and developed. Through the mechanism of a hypothetical modern IT user, we covered (at a very high level) the necessary ingredients of the IT experience. We considered the user’s moment of truth, and the massive IT complexity that sustains it. We also discussed a high-level lifecycle model for IT applications and services, and explored some initial definitions for user, customer, and sponsor — critical distinctions to make in an age of digital transformation.

As you proceed into the course, the key takeway from this chapter is “why IT?” Why do people need it, and how is it valuable? That should always remain at the top of your mind as you proceed in your IT education.

1.8.1. Discussion questions

  1. Discuss: How does IT contribute to your enjoyment of life and experiences of value?

  2. Read Apps Are Wrecking Mom-and-Pop Pizza Shops and discuss whether “IT matters” to the local pizzerias.

  3. Read the Wikipedia articles on mainframe computing and Amazon Web Services and discuss with your team. What has changed in computing? What remains the same?

1.8.2. Research & practice

  1. Go to any popular online service (Facebook, Netflix, Flickr, etc.).. How would you describe the “moments of truth” or value experiences these sites offer users? There may be several.

  2. On your own or with a team, develop an idea for an IT-based product you could take to market. Present to the class.

  3. (Continued) Who are the user, customer, and sponsor of your product?

  4. Research and apply one of the business case analysis techniques to your idea.

1.8.3. Further reading



2. Digital infrastructure

2.1. Introduction

equipment in racks
Figure 17. Racks in a data center

As mentioned in the Part introduction, you cannot start developing a product until you decide what you will build it with. (You may have a difficult time writing an app for a mobile phone if you choose the COBOL programming language!) You also need to understand something of how computers are operated, enough so that you can make decisions on how your system will run. Most startups choose to run IT services on infrastructure owned by a cloud computing provider, but there are other options. Certainly, as you scale up, you’ll need to be more and more sophisticated in your understanding of your underlying IT services.

Configuring your base platform is one of the most important capabilities you will need to develop. You’ll never stop doing it. The basis of modern configuration management is version control, which we cover here.

This is one of the more technical chapters. Supplementary reading may be required for those completely unfamiliar with computing. See Assumptions of the Reader for notes on the book’s approach. [11]

2.1.1. Chapter summary

  • Introduction

    • Chapter summary

    • Learning objectives

  • Infrastructure overview

    • What is infrastructure?

    • Basic IT infrastructure concepts

  • Choosing infrastructure

    • From “physical” compute to cloud

    • Virtualization

    • Why is virtualization important?

    • Virtualization versus cloud

    • Containers and looking ahead

  • Infrastructure as code

    • A simple infrastructure as code example

  • Configuration management: the basics

    • What is version control?

    • Package management

    • Deployment management

  • Topics in IT infrastructure

    • Configuration management, version control, and metadata

  • Conclusion

    • Discussion questions

    • Research & practice

    • Further reading

2.1.2. Learning objectives

  • Understand fundamental principles of operating computers as infrastructure for a service

  • Understand cloud as a computing option

  • Understand basic principles of “infrastructure as code”

  • Understand the importance and basic practices of version control and why it applies to infrastructure management

2.2. Infrastructure overview

If you are familiar with computers and networks, you may wish to skip ahead to Choosing infrastructure.

In the previous chapter, you were introduced to the concept of a "moment of truth" , and in the final exercises, asked to think of a product idea. Some part of that product requires writing software, or at least configuring some IT system. (IT being defined as in Chapter 1). You presumably have some resources (time and money). It’s Monday morning, you have cleared all distractions, shut down your Twitter and Facebook feeds, and are ready to start building.

Not so fast.

Before you can start writing code, you need to decide how and where it will run. This means you need some kind of a platform -— some computing resources, most likely networked, where you can build your product and eventually expose it to the world. It’s hard to build before you decide on your materials and tools. You need to decide what language programming language you are going to write in, what framework you are going to use, and how those resources will result in an operational system capable of rendering IT services. You are probably swimming in a sea of advice and options regarding your technical choices. In previous decades, books such as this might have gone into the specifics of particular platforms: mainframe versus minicomputers, COBOL versusFORTRAN, Windows versusUNIX systems, etc.

At this writing, JavaScript is a leading choice of programming language, in conjunction with various frameworks and NoSQL options (e.g.,the MEAN stack, for MongoDB, Express, Angular, and Node.js), but millions of developers are still writing Java and .Net, and Ruby and Python have significant followings. Linux is arguably the leading platform, but commercial UNIX and Microsoft platforms are still strong. And, periodically it’s reported that the majority of the world’s transactions still run on COBOL-based systems.

However, in the past few years, some powerful infrastructure concepts have solidified that are independent of particular platforms:

  • “Cloud"-based technology services

  • Automation and “infrastructure as code”

  • The centrality of source control

  • The importance of package management

  • Policy-based infrastructure management

(We’ll get to test-driven development, pipeline automation & DevOps in the next chapter).

This might seem like a detour — you are in a hurry to start writing code! But industry practice is clear. You check your code into source control from Day One. You define your server configurations as recipes, manifests, or at least shell scripts, and check those definitions into source control as well. You keep track of what you have downloaded from the Internet and what version of stuff you are using, through package management (which uses different tools than source control). Always downloading the “latest” package from its upstream creator might seem like the way to stay current, but it will kill you when stuff works on one server but not on another.

So, you need to understand a few things and make a few decisions that you will be living with for a while, and will not be easily changed.

2.2.1. What is infrastructure?

Infrastructure is a tricky word. Google defines it thus:

The basic physical and organizational structures and facilities (e.g., buildings, roads, and power supplies) needed for the operation of a society or enterprise.

In general, it connotes the stuff behind the scenes, the things you need but don’t want to spend a lot of time thinking about. We will spend a lot of time examining what we mean by “infrastructure” in this book, as it is fundamental to understanding the “business of IT.” This book defines “IT infrastructure” recursively as “the set of IT concerns that are of particular interest to IT.”

  • An application or business service is consumed by people who are NOT primarily concerned with IT. For example, a customer-facing online banking service is consumed by end users.

  • An IT infrastructure service is a service consumed by other IT-centric teams and capabilities. For example, a database or a load balancing service is consumed by other IT teams.

IT infrastructure is one form of infrastructure. Other kinds of infrastructure might include mechanical, electrical, and plant (ME & P) investments . IT infrastructure, like IT itself, is defined by its fundamental dependence on information and computing theory.

Today’s application becomes tomorrow’s infrastructure

Forty years ago, building an “application” would have included building its database, perhaps even its file management. This was rightly determined to be a general-case problem that could be the basis for commodity software, and so companies like Oracle were born.

Operating systems took on more and more functionality, technology became more and more reliable and easily configured, and once unique (and competitively differentiating) functionality became commoditized and more and more deeply buried in the “stack”: the IT infrastructure.

2.2.2. Basic IT infrastructure concepts

There are many books (some in the Further Reading section for this chapter) on all aspects of IT infrastructure, which is a broad and deep topic. Our discussion of it here has to be high level, as appropriate for a survey course. We’ve established what we mean by IT generally. Pragmatically, there are three major physical aspects to “IT infrastructure” relevant to the practitioner:

  • Computing cycles (sometimes called just “compute”)

  • Memory & storage (or “storage”)

  • Networking & communications (or “network”)

We will discuss a variety of subsidiary concerns and concepts, but those are the big three.

Compute is the resource that performs the rapid, clock-driven digital logic that transforms data inputs to outputs.

For a beginner-level introduction to the world of digital logic at its most fundamental, see Code, by Charles Petzold.

If we have a picture of our friend, and we use a digital filter to adjust the brightness, that is an application of compute power. The picture is made up of “pixels” (see Picture enlarged to show pixels[12]]) which are nothing but numbers representing the color of some tiny square of the picture. Each of those numbers needs to be evaluated and the brightness adjusted. There may be millions in a single image.

Figure 18. Picture enlarged to show pixels

For example, let’s say that the pixel values range from 1 to 10, with 1 being the darkest and 10 being the lightest (in reality, the range is much larger). To brighten a picture, we might tell the computer:

  1. Look at a pixel

  2. If it is between 0 and 3 add 2

  3. If it is between 4 and 6 add 1

  4. Move to a new pixel

  5. Repeat the above 4 lines until all pixels are done

At a more realistic level, the process might be executed as a batch on a workstation, for hundreds or thousands of photos at a time.

Computers process instructions at the level of “true” and “false,” represented as binary “1s” and “0s.” Because humans cannot easily understand binary data and processing, higher-level abstractions of machine code and programming languages are used.

It’s critical to understand that computers, traditionally understood, can only operate in precise, "either-or" ways. Computers are often used to automate business processes, but in order to do so, the process needs to be carefully defined, with no ambiguity. Either the customer has bought the item, or they have not. Either they live in Minnesota, or Wisconsin. Complications and nuances, intuitive understandings, judgment calls -— in general, computers can’t do any of this, unless and until you program them to -— at which point the logic is no longer intuitive or a judgment call. (Deep learning systems based on neural networks, and similar artificial intelligence, are beyond our discussion here -— and in any event, are still based on these binary fundamentals).

The irony is that, as we will discuss repeatedly in this book, the process by which these deterministic digital systems are developed is itself a process requiring all those "soft" components that computers can’t automate.

Computer processing is not free. Moving data from one point to another — the fundamental transmission of information — requires matter and energy, and is bound up in physical reality and the laws of thermodynamics. The same applies for changing the state of data, which usually involves moving it somewhere, operating on it, and returning it to its original location. In the real world, even running the simplest calculation has physical and therefore economic cost, and so we must pay for computing.

If these concepts are strange to you, spend some time with the suggested Wikipedia articles, or otherwise researching the topics. Wikipedia in the area of fundamental computer concepts is generally accurate.
hard drives in array
Figure 19. Disks in a storage array

Storage But where did the picture come from? The data comprising the pixels needs to be stored somewhere (see Disks in a storage array [13]). Sometimes you will hear the technical term “persisted.” The combined set of pixels and their precise values can be termed the “state” of the photograph; the digital logic of the filter alters the state, and also needs to save this new state somewhere (otherwise it will be lost).

Many technologies have been used for digital storage. Increasingly, the IT professional need not be concerned with the physical infrastructure used for storing data. As we will cover in the next section, storage increasingly is experienced as a virtual resource, accessed through executing programmed logic on cloud platforms. “Underneath the covers” the cloud provider might be using various forms of storage, from RAM to solid state drives to tapes, but the end user is, ideally, shielded from the implementation details (part of the definition of a service).

However, it is important to understand that in general, storage follows a hierarchy. Just as we might “store” a document by holding it in our hands, setting it on a desktop, filing it in a cabinet, or archiving it in a banker’s box in an offsite warehouse, so computer storage also has different levels of speed and accessibility:

  • On-chip registers and cache

  • Random-access memory (RAM), aka “main memory”

  • Online mass storage, often “disk”

  • Offline mass storage, e.g., “tape”

If this is unfamiliar, see Wikipedia or research on your own; you should have a basic grasp of this issue.

Network We can change the state of some data, or store it. We also need to move it. This is the basic concern of networking, to transmit data (or information) from one location to another. If you use your cell phone to look up your bank balance, there is a network involved (see Network cabling in a rack[14]) -— otherwise, how did the data get from the bank’s computer in New Jersey to your cell phone in Minnesota? We see evidence of networking every day; you may be familiar with coaxial cables for cable TV, or telephone lines strung from pole to pole in many areas. However, like storage, there is also a hierarchy of networking:

cables and equipment
Figure 20. Network cabling in a rack

And like storage and compute, networking as a service increasingly is independent of implementation. The developer uses programmatic tools to define expected information transmission, and (again ideally) need not be concerned with the specific networking technologies or architectures serving their needs.

2.3. Choosing infrastructure

The following is written as if you are a decision-maker in the early stages of conceiving a product. If you are an individual contributor in a large enterprise, or even a newcomer to an established product team, these decisions will likely have been made for you.

But at some point, someone had to go through this decision process, before anything could be developed.
servers on a rack
Figure 21. Tower-style servers on a rack

There is ferocious turbulence in the IT infrastructure market. Cloud computing, containers, serverless computing, providers coming and going, various arguments over "which platform is better," and so forth. As an entrepreneur, you need to understand what technical trends are important to you. Furthermore, you will need to make some level of commitment to your technical architecture. And at some point you WILL be asked, “You’re still using that old technology?” As a startup, you have a couple initial decisions to make regarding infrastructure and tools:

  • What is my vision for bringing my product to the world?

  • What toolset should I use to forward this vision?

As a startup, it would seem likely that you would use a commodity cloud provider. This text is based on this assumption (physical IT asset management will be discussed in Sections 3 and 4). Is there any reason why the public cloud would not work for you? For example, if you want to develop on a LAMP stack, you need a cloud provider that will support you in this. While most are very flexible, you will need to consider the specific support levels they are offering; a provider that supports the platform (and not just the operating system) might be more valuable, but there may be higher costs and other trade-offs.

There is a near-infinite amount of material, debate, discussion, books, blogs, lists, and so forth concerning choice of language and platform. Exploring this area is not the purpose of this book. However, this book has certain assumptions:

  • Your system will be built, at least in part, with some form of programming language which is human-readable and compiled or interpreted into binary instructions.

  • Your system will run on general-purpose digital computers using well known technologies.

  • Your computing environment is networked in a standard way.

  • You use the concept of a software pipeline, in which new functionality is developed in a scope distinct from what is currently offered as your product/service.

  • New functionality moves through the pipeline at significant volumes and velocity and you are concerned with optimizing this overall flow [15].

Dynamic, automated infrastructure (as provided by cloud suppliers) enables rapid iterations and scaling. Iterative development and rapid scaling, while possible, was often more difficult with earlier, less automated technical platforms.

There is a long tradition in IT management of saying “How can you be thinking about infrastructure before you have gone deeply into requirements?”

Let’s be clear, in defining a product (Chapter 1) you already have started to think about "requirements," although we have not yet started to use that term. (We’ll define it in Chapter 3). The idea that all requirements need to be understood in detail before considering technical platform is, in general, an outmoded concept that made more sense when hardware was more specialized and only available as expensive, organization-owned assets. With the emergence of cloud providers able to sell computing services, companies no longer need to commit to large capital outlays. And digital product professionals realize that requirements are never fully “understood” up front (more on this in the next chapter). Your MVP is an initial statement of requirements from which you should be able to infer at least initial toolset and platform requirements. Here to get you started are the major players as of this writing:

Table 3. Major technical stacks
Stack 1 (Enterprise Java) Stack 2 (Microsoft) Stack 3 (LAMP) Stack 4 (MEAN)




JavaScript, Express & Angular

Oracle DB

MS SQL Server





Apache Web Server


Commercial UNIX systems

Microsoft Windows

Red Hat Linux

Ubuntu Linux

Ruby on Rails is another frequently-encountered platform. If you are building a data or analytics-centric product, R and Python are popular. There is a good reason, however, why you should not spend too much time “analyzing” before you make platform decisions. The reality is that you cannot know all of the factors necessary to make a perfect decision, and in fact the way you will learn them is by moving forward and testing out various approaches. So, choose Ruby on Rails, or LAMP, or MEAN, and a hosting provider who supports them, and start. You can easily stand up environments for comparison using cloud services, or even with lightweight virtualization (Vagrant or Docker) on your own personal laptop. Do not fall into analysis paralysis. But be critical of everything especially in your first few weeks of development. Ask yourself:

  • Can I see myself doing things this way for the next year?

  • Will I be able to train people in this platform?

  • Will this scale to a bigger code base? Higher performance? Faster throughput of new features?

If you become uncomfortable with the answers, you should consider alternatives.

The technical “spike"

Scrum-based development uses the concept of a “spike” to represent effort whose outcome is not a shippable product, but rather research and information. Consider thinking of your infrastructure selection process in this way.

2.4. From “physical” compute to cloud

a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. [197]

Before cloud, people generally bought computers of varying sizes and power ratings to deliver the IT value they sought. With cloud services, the same compute capacity can be rented or leased by the minute or hour, and accessed over the Internet.

There is much to learn about cloud computing. In this section, we will discuss the following aspects:

  • Virtualization basics

  • Virtualization, managed services, and cloud

  • The various kinds of cloud services

  • Future trends

2.4.1. Virtualization

Virtualization, for the purposes of this section, starts with the idea of a computer within a computer. (It has applicability to storage and networking as well but we will skip that for now). In order to understand this, we need to understand a little bit about operating systems and how they relate to the physical computer.

Figure 22. Laptop computer

Assume a simple, physical computer such as a laptop (see Laptop computer, [16]). When the laptop is first turned on, the OS loads; the OS is itself software, but is able to directly control the computer’s physical resources: its CPU, memory, screen, and any interfaces such as WiFi, USB, and Bluetooth. The operating system (in a traditional approach) then is used to run “applications” such as Web browsers, media players, word processors, spreadsheets, and the like. Many such programs can also be run as applications within the browser, but the browser still needs to be run as an application.

In the simplest form of virtualization, a specialized application known as a hypervisor is loaded like any other application. The purpose of this hypervisor is to emulate the hardware computer in software. Once the hypervisor is running, it can emulate any number of “virtual” computers, each of which can have its own operating system (see Virtualization is computers within a computer). The hypervisor mediates the virtual machine (VM) access to the actual, physical hardware of the laptop; the VM can take input from the USB port, and output to the Bluetooth interface, just like the master OS that launched when the laptop was turned on.

There are two different kinds of hypervisors. The example we just discussed was an example of a Type 2 hypervisor, which runs on top of a host OS. In a Type 1 hypervisor, a master host OS is not used; the hypervisor runs on the “bare metal” of the computer and in turn “hosts” multiple VMs.

little computers inside big one
Figure 23. Virtualization is computers within a computer

You can experiment with a hypervisor by downloading Virtualbox (on Windows, Mac OS, or Linux) and using Vagrant to download and initialize a Linux virtual machine. You’ll probably want at least 4 gigabytes of RAM on your laptop and a gigabyte of free disk space, at the bare minimum.

Paravirtualization, e.g., containers, is another form of virtualization found in the marketplace. In a paravirtualized environment, a core OS is able to abstract hardware resources for multiple virtual guest environments without having to virtualize hardware for each guest. The benefit of this type of virtualization is increased Input/Output (I/O) efficiency and performance for each of the guest environments.

However, while hypervisors can support a diverse array of virtual machines with different OSs on a single computing node, guest environments in a paravirtualized system generally share a single OS. See Virtualization types for an overview of all the types.

Virtualization types
Figure 24. Virtualization types
Virtualization was predicted in the earliest theories that led to the development of computers. Turing and Church realized that any general-purpose computer could emulate any other. Virtual systems have existed in some form since at latest 1967 — only 20 years after the first fully functional computers.

And yes, you can run computers within computers within computers with virtualization. Not all products support this, and they get slower and slower the more levels you create, but the logic still works.

2.4.2. Why is virtualization important?

Virtualization attracted business attention as a means to consolidate computing workloads. For years, companies would purchase servers to run applications of various sizes, and in many cases the computers were badly underutilized. Because of configuration issues and (arguably) an overabundance of caution, average utilization in a pre-virtualization data center might average 10-20%. That’s up to 90% of the computer’s capacity being wasted (see Inefficient utilization).

Figure 25. Inefficient utilization

The above figure is a simplification. Computing and storage infrastructure supporting each application stack in the business were sized to support each workload. For example, a payroll server might run on a different infrastructure configuration than a data warehouse server. Large enterprises needed to support hundreds of different infrastructure configurations, increasing maintenance and support costs.

The adoption of virtualization allowed businesses to compress multiple application workloads onto a smaller number of physical servers (see Efficiency through virtualization).

efficient util
Figure 26. Efficiency through virtualization
For illustration only. A utilization of 62.5% might actually be a bit too high for comfort, depending on the variability and criticality of the workloads.

In most virtualized architectures, the physical servers supporting workloads share a consistent configuration, which made it easy to add and remove resources from the environment. The VMs may still vary greatly in configuration, but the fact of virtualization makes managing that easier — the VMs can be easily copied and moved, and increasingly can be defined as a form of code (see next section).

Virtualization thus introduced a new design pattern into the enterprise where computing and storage infrastructure became commoditized building blocks supporting an ever-increasing array of services. But what about where the application is large and virtualization is mostly overhead? Virtualization still may make sense in terms of management consistency and ease of system recovery.

2.4.3. Virtualization, managed services, and cloud

Companies have always sought alternatives to owning their own computers. There is a long tradition of managed services, where applications are built out by a customer and then their management is outsourced to a third party. Using fractions of mainframe “time-sharing” systems is a practice that dates back decades. However, such relationships took effort to set up and manage, and might even require bringing physical tapes to the third party (sometimes called a “service bureau.”) Fixed price commitments were usually high (the customer had to guarantee to spend X dollars). Such relationships left much to be desired in terms of responsiveness to change.

As computers became cheaper, companies increasingly acquired their own data centers, investing large amounts of capital in high-technology spaces with extensive power and cooling infrastructure. This was the trend through the late 1980s to about 2010, when cloud computing started to provide a realistic alternative with true “pay as you go” pricing, analogous to electric metering.

Figure 27. Initial statement of cloud computing

The idea of running IT completely as a utility service goes back at least to 1965 and the publication of The Challenge of the Computer Utility, by Douglas Parkhill (see Initial statement of cloud computing). While the conceptual idea of cloud and utility computing was foreseeable 50 years ago, it took many years of hard-won IT evolution to support the vision. Reliable hardware of exponentially increasing performance, robust open-source software, Internet backbones of massive speed and capacity, and many other factors converged towards this end.

However, people store data — often private — on computers. In order to deliver compute as a utility, it is essential to segregate each customer’s workload from all others. This is called multi-tenancy. In multi-tenancy, multiple customers share physical resources that provide the illusion of being dedicated.

The phone system has been multi-tenant ever since they got rid of party lines. A party line was a shared line where anyone on it could hear every other person.

In order to run compute as a utility, multi-tenancy was essential. This is different from electricity (but similar to the phone system). As noted elsewhere, one watt of electric power is like any other and there is less concern for leakage or unexpected interactions. People’s bank balances are not encoded somehow into the power generation and distribution infrastructure.

Virtualization is necessary, but not sufficient for cloud. True cloud services are highly automated, and most cloud analysts will insist that if VMs cannot be created and configured in a completely automated fashion, the service is not true cloud. This is currently where many in-house “private” cloud efforts struggle; they may have virtualization, but struggle to make it fully self-service.

Cloud services have refined into at least three major models:

  • Infrastructure as a service

  • Platform as a service

  • Software as a service

Software as a Service (SaaS). The capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure. The applications are accessible from various client devices through either a thin client interface, such as a web browser (e.g., web-based email), or a program interface. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.

Platform as a Service (PaaS). The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages, libraries, services, and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly configuration settings for the application-hosting environment.

Infrastructure as a Service (IaaS). The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, and deployed applications; and possibly limited control of select networking components (e.g., host firewalls) [197].

There are cloud services beyond those listed above (e.g.,Storage as a Service). Various platform services have become extensive on providers such as Amazon, which offers load balancing, development pipelines, various kinds of storage, and much more.

Traditional managed services are sometimes called “your mess for less.” With cloud, you have to “clean it up first.”

2.4.4. Containers and looking ahead

At this writing, two major developments in cloud computing are prominent:

  • The combination of cloud computing with paravirtualization, including technologies such as Docker

    • Containers are lighter weight than VMs

      • Virtualized Guest OS: Seconds to instantiate

      • Container: Milliseconds (!)

    • Containers must be the same OS as host

  • AWS Lambda, “a compute service that runs your code in response to events and automatically manages the compute resources for you, making it easy to build applications that respond quickly to new information.” More broadly, the name “serverless” has been applied to this style of computing.

It’s recommended you at least scan the links provided.

Eventually, scale matters. As your IT service’s usage increases, you will inevitably find that you need to start caring about technical details such as storage and network architecture. The implementation decisions made by you and your service providers may become inefficient for the particular “workload” your product represents, and you will need to start asking questions. The brief technical writeup, Latency Numbers Every Programmer Should Know can help you start thinking about these issues.

2.5. Infrastructure as code

2.5.1. Why infrastructure as code?

Infrastructure as code is an approach to infrastructure automation based on practices from software development. It emphasizes consistent, repeatable routines for provisioning and changing systems and their configuration. Changes are made to definitions and then rolled out to systems through unattended processes that include thorough validation. The premise is that modern tooling can treat infrastructure as if it were software and data. This allows people to apply software development tools such as Version Control Systems (VCS), automated testing libraries, and deployment orchestration to manage infrastructure. It also opens the door to exploit development practices such as Test-Driven Development (TDD), Continuous Integration (CI), and Continuous Delivery (CD) [191].
— Kief Morris
Infrastructure as Code: Managing Servers in the Cloud

So, what is infrastructure as code?

As cloud infrastructures have scaled, there has been an increasing need to configure many servers identically. Auto-scaling (adding more servers in response to increasing load) has become a widely used strategy as well. Both call for increased automation in the provisioning of IT infrastructure. It is simply not possible for a human being to be hands on at all times in configuring and enabling such infrastructures, so automation is called for.

In years past, infrastructure administrators relied on the ad hoc issuance of commands either at an operations console or via a GUI-based application. Shell scripts might be used for various repetitive processes, but administrators by tradition and culture were empowered to issue arbitrary commands to alter the state of the running system directly.

The following passage from The Phoenix Project (by Gene Kim, Kevin Behr, and George Spafford) captures some of the issues. The speaker is Wes, the infrastructure manager, who is discussing a troubleshooting scenario:

“Several months ago, we were three hours into a Sev 1 outage, and we bent over backward not to escalate to Brent. But eventually, we got to a point where we were just out of ideas, and we were starting to make things worse. So, we put Brent on the problem.” He shakes his head, recalling the memory, “He sat down at the keyboard, and it’s like he went into this trance. Ten minutes later, the problem is fixed. Everyone is happy and relieved that the system came back up. But then someone asked, ‘How did you do it?’ And I swear to God, Brent just looked back at him blankly and said, ‘I have no idea. I just did it.’” [153 p. 116].

Obviously, “close your eyes and go into a trance” is not a repeatable process. It is not a procedure or operation that can be archived and distributed across multiple servers. So, shell scripts or more advanced forms of automation are written and increasingly, all actual server configuration is based on such pre-developed specification. It is becoming more and more rare for a systems administrator to actually “log in” to a server and execute configuration-changing commands in an ad hoc manner (as Brent).

In fact, because virtualization is becoming so powerful, servers increasingly are destroyed and rebuilt at the first sign of any trouble. In this way, it is certain that the server’s configuration is as intended. This again is a relatively new practice.

Previously, because of the expense and complexity of bare-metal servers, and the cost of having them offline, great pains were taken to fix troubled servers. Systems administrators would spend hours or days troubleshooting obscure configuration problems, such as residual settings left by removed software. Certain servers might start to develop “personalities.” Industry practice has changed dramatically here since around 2010. See the "Cattle not pets?" sidebar.

2.5.2. A simple infrastructure as code example

Figure 28. Simple directory/file structure script

Note: the below part is illustrative only, and is not intended as a lab. The associated lab for this book goes into depth on these topics.

In presenting infrastructure as code at its simplest, we will start with the concept of a shell script. While this is not a deep Linux book (there are many others out there, starting with the excellent O’Reilly lineup), some basic technical literacy is assumed in this book. Consider the following set of commands:

$ mkdir foo bar
$ cd foo
$ touch x y z
$ cd ../bar
$ touch a b c

What does this do? It tells the computer:

  1. Create (mkdir) two directories, one named foo and one named bar

  2. Move (cd) to the one named foo

  3. Create (touch) three files, named x, y, and z

  4. Move to the directory named bar

  5. Create three blank files, named a, b, and c

If you find yourself (with the appropriate permissions) at a UNIX or Linux command prompt, and run those commands, you will wind up with a configuration that could be visualized as in Simple directory/file structure script. (If you don’t understand this, you should probably spend a couple hours with a Linux tutorial).

Configuration, you ask? Something this trivial? Yes, directory and file layouts count as configuration and in some cases are critical. Now, what if we take that same set of commands, and put them in a text file thus:

mkdir foo bar
cd foo
touch x y z
cd ../bar
touch a b c

We might name that file, set its permissions correctly, and run it (so that the computer executes all the commands for us, rather than us running them one at a time at the console). If we did so in an empty directory, we’d again wind up with that same configuration. (If we did it in a directory already containing foo and bar directories, we’d get errors. More on this to come).

The state of the art in infrastructure configuration is not to use shell scripts at all but rather policy-based infrastructure management approaches, which we discuss subsequently.

This may be familiar material to some of you, including the fact that beyond creating directories and files we can use shell scripts to create and destroy virtual servers, install and remove software, set up and delete users, check on the status of running processes, and much more.

Sophisticated infrastructure as code techniques are an essential part of modern site reliability engineering practices such as those used by Google. Auto-scaling, self-healing systems, and fast deployments of new features all require that infrastructure be represented as code for maximum speed and reliability of creation. For further information and practical examples, see Infrastructure as Code by Kief Morris [191].

Let’s return to our file. It’s valuable. It documents our intentions for how this configuration should look. We can reliably run it on thousands of machines, and it will always give us two directories and six files. In terms of the previous section, we might choose to run it on every new server we create. We want to establish it as a known resource in our technical ecosystem. This is where version control and the broader concept of configuration management come in.

Cattle not pets?

In earlier times, servers (that is, computers managed on a distributed network) were usually configured without virtualization. They arrived (carefully packed on pallets) from the manufacturer unconfigured, and would be painstakingly “built” by the systems administrator: the OS would be compiled and installed, key software packages (such as Java) installed, and then the organization’s customer software installed.

At best, the systems administrators, or server engineers, might have written guidelines, or perhaps some shell scripts, that would be run on the server to configure it in a semi-consistent way. But that documentation would often be out of date, the scripts would be unique to a given administrator, and there would be great reluctance to “rebuild the box” — that is, to delete everything on it and do a “clean re-install.” Instead, if there were problems, the administrator would try to fix the server by going in and adjusting particular settings (typically by changing configuration files and restarting services), or deleting software packages and re-installing them.

The problem with this is that modern computing systems are so complex that deleting software can be difficult. For example, if the un-install process fails in some way, the server can be left in a compromised state. Similarly, one-time configuration adjustments made to one server means that it may be inconsistent with similar devices, and this can cause problems. For example, if the first systems administrator is on vacation, their substitute may expect the server to be configured in a certain way and make adjustments that have unexpected effects. Or the first systems administrator themselves may forget, exactly, what it is they did. Through such practices, servers would start to develop personalities, because their configurations were inconsistent.

As people started to work more and more with virtualization, they realized it was easier to rebuild virtual servers from scratch, rather than trying to fix them. Automated configuration management tools helped by promoting a consistent process for rebuilding. Randy Bias, noting this, put forth the provocative idea that “servers are cattle, not pets” [29]. That is, when a pet is sick, one takes it to the vet, but a sick cow might simply be put to death.

A more compassionate image might be to say that “servers are fleet vehicles, not collectible cars” (see Collectible car versus fleet vehicles [17]) The cattle metaphor also overlooks the fact that large animal veterinarians are routinely employed in the cattle industry.

collectible car
Figure 29. Collectible car versus fleet vehicles

2.6. Configuration management: the basics

Configuration management is, and has always been, a critically important practice in digital systems. How it is performed has evolved over time. At this stage in our journey, we are a one or two people startup, working with digital artifacts such as our example discussed in the previous section.

One or two people can achieve an impressive amount with modern digital platforms. But, the work is complex. Tracking and controlling your work products as they evolve through change after change is important from day one of your efforts. It’s not something you want to put off to “later when I have time.” This applies to computer code, configurations, and increasingly, documentation, which is often written in a lightweight markup language like Markdown or Asciidoc. In terms of infrastructure, configuration management requires three capabilities:

  • The ability to backup or archive a system’s operational state (in general, not including the data it is processing — that is a different concern). Taking the backup should not require taking the system down.

  • The ability to compare two versions of the system’s state and identify differences.

  • The ability to restore the system to a previously archived operational state.

In this section, we will discuss the following topics:

  • Version control

  • Source control

  • Package management

  • Deployment management

  • Configuration management

and their relationships.

2.6.1. What is version control?

In software development, version control is the foundation of every other Agile technical practice. Without version control, there is no build, no TDD, no continuous integration [11 p. 6013].
— Andrew Clay Shafer
Web Operations: Keeping the Data On Time

The Agile Alliance indicates “version control” as one of the four foundational areas of Agile [8], along with the team, iterative development, and incremental development. Why is this? Version control is critical for any kind of system with complex, changing content, especially when many people are working on that content. Version control provides the capability of seeing the exact sequence of a complex system’s evolution and isolating any particular moment in its history or providing detailed analysis on how two versions differ. With version control, we can understand what changed and when – which is essential to coping with complexity.

While version control was always deemed important for software artifacts, it has only recently become the preferred paradigm for managing infrastructure state as well. Because of this, version control is possibly the first IT management system you should acquire and implement (perhaps as a cloud service, such as Github).

Version control in recent years increasingly distinguishes between source control and package management (see Types of version control and Configuration management and its components below): the management of binary files, as distinct from human-understandable symbolic files. It is also important to understand what versions are installed on what computers; this can be termed “deployment management.” (With the advent of containers, this is a particularly fast changing area).

Version control types
Figure 30. Types of version control

Version control works like an advanced file system with a memory. (Actual file systems that do this are called versioning file systems). It can remember all the changes you make to its contents, tell you the differences between any two versions, and also bring back the version you had at any point in time. Version control is important — but how important? Survey research presented in the annual State of DevOps report indicates that version control is one of the most critical practices associated with high performing IT organizations [38]. Nicole Forsgren [94] summarizes the practice of version control as:

  • Our application code is in a version control system

  • Our system configurations are in a version control system

  • Our application configurations are in a version control system

  • Our scripts for automating build and configuration are in a version control system

Source control

Source control
Figure 31. Source control

Digital systems start with text files, e.g., those encoded in ASCII or Unicode. Text editors create source code, scripts, and configuration files. These will be transformed in defined ways (e.g.,by compilers and build tools) but the human understandable end of the process is mostly based on text files. In the previous section, we described a simple script that altered the state of a computer system. We care very much about when such a text file changes. One wrong character can completely alter the behavior of a large, complex system. Therefore, our configuration management approach must track to that level of detail.

Source control is at its most powerful when dealing with textual data. It is less useful in dealing with binary data, such as image files. Text files can be analyzed for their differences in an easy to understand way (see Source control). If I change “abc” to “abd,” then it is clear that the third character has been changed from “c” to “d.” On the other hand, if I take a picture (e.g.,as a *.png file) and alter one pixel, and compare the resulting before and after binary files in terms of their data, it would be more difficult to understand what had changed. I might be able to tell that they are two different files easily, but they would look very similar, and the difference in the binary data might be difficult to understand.

Here is a detailed demonstration, using the command line in Ubuntu Linux. (Don’t worry, we will explain what is going on).

In the below sequence, what you see after the "$” sign is what is being typed. If there is no "$” sign, it is what the system is saying in response.

First, we create a directory (similar to the example script):

~$ mkdir tmpgit

Then, we navigate to it:

~$ cd tmpgit/

And activate git source control:

~/tmpgit$ git init
Initialized empty Git repository in /home/char/tmpgit/.git/

We create a simple program:

~/tmpgit$ echo 'print “hello world!";' >

And run it:

~/tmpgit$ python
hello world!

We stage it for source control:

~/tmpgit$ git add .

And commit it:

~/tmpgit$ git commit -m “first commit”
[master (root-commit) cabdbe3] first commit
1 file changed, 1 insertion(+)
create mode 100644

The file is now under version control. We can change our working copy and run it:

~/tmpgit$ echo 'print “hello universe!";' >
~/tmpgit$ python
hello universe!

When the “echo” command is run with just one ">” it replaces the data in the target file completely. So we have completely replaced “hello world!” with “hello universe!”

And — most critically — we can see what we have changed!

~/tmpgit$ git diff
diff --git a/ b/
index 0ecbd83..a203522 100644
--— a/
+++ b/
@@ -1 +1 @@
-print “hello world!";
+print “hello universe!";

Notice the "-” (minus) sign before the statement 'print “hello world!";' -— that means that line has been deleted. The "+” (plus) sign before 'print “hello universe!";' means that line has been added.

We can restore the original file (note that this eradicates the working change we made!)

char@elsa:~/tmpgit$ git checkout .
char@elsa:~/tmpgit$ python
hello world!

If you have access to a computer, try it! (You will need to install git, and if you are on Windows you should use WSL, the Windows Subsystem for Linux).

In comparison, the following are two 10x10 gray-scale bitmap images being edited in the Gimp image editor. They are about as simple as you can get. Notice (in Two bitmaps) that they are slightly different.

Figure 32. Two bitmaps

If we save these in the Portable Network Graphics (*.png) format, we can see they are different sizes (242 k _versus_239k). But if we open them in a binary editor it is very difficult to understand how they differ (compare First file binary data with Second file binary data).

Figure 33. First file binary data
Figure 34. Second file binary data

Even if we analyzed the differences, we would need to know much about the .png format in order to understand how the two images differ. We can still track both versions of these files, of course, with the proper version control. But again, binary data is not ideal for source control tools like git.

The “commit” concept

Although implementation details may differ, all version control systems have some concept of “commit.” As stated in Version Control with Git [173]:

In Git, a commit is used to record changes to a repository … Every Git commit represents a single, atomic changeset with respect to the previous state. Regardless of the number of directories, files, lines, or bytes that change with a commit … either all changes apply, or none do. [emphasis added]

Why “atomic"? The word atomic derives from the ancient Latin language, and means “indivisible.” An atomic set of changes is either entirely applied, or entirely rejected. Atomicity is an important concept in computing and transaction processing, in particular. If our user tries to move money from her savings to her checking account, two operations are required: (1) reduce savings and (2) increase checking. Either both need to succeed, or both need to fail. That is the classic definition of an “atomic” transaction. Version control commits should be atomic.

The concept of a version or source control “commit” is a rich foundation for IT management and governance. It both represents the state of the computing system as well as providing evidence of the human activity affecting it. As we will see in Chapter 3, the “commit” identifier is directly referenced by build activity, which in turn is referenced by the release activity, which is typically visible across the IT value chain.

Also, the concept of an atomic “commit” is essential to the concept of a “branch” -— the creation of an experimental version, completely separate from the main version, so that various alterations can be tried without compromising the overall system stability. Starting at the point of a “commit,” the branched version also becomes evidence of human activity around a potential future for the system. In some environments, the branch is automatically created with the assignment of a requirement or story -— again, more on this to come in chapter 3. In other environments, the very concept of branching is avoided.

2.6.2. Package management

Implement version control for all production artifacts. [216]
— Puppet Labs 2015 State of DevOps report
Figure 35. Building software

Much if not most software, once created as some kind of text-based artifact suitable for source control, must be compiled and further organized into deployable assets, often called “packages” (see Building software).

In some organizations, it was once common for compiled binaries to be stored in the same repositories as source code (see Common version control). However, this is no longer considered a best practice. Source and package management are now viewed as two separate things (see Source versus package repos). Source repositories should be reserved for text-based artifacts whose differences can be made visible in a human-understandable way. Package repositories in contrast are for binary artifacts that can be deployed.

integrated VCS
Figure 36. Common version control

Package repositories also can serve as a proxy to the external world of downloadable software. That is, they are a cache, an intermediate store of the software provided by various external or “upstream” sources. For example, developers may be told to download the approved Ruby on Rails version from the local package repository, rather than going to get the latest version, which may not be suitable for the environment.

Package repositories furthermore are used to enable collaboration between teams working on large systems. Teams can check in their built components into the package repository for other teams to download. This is more efficient than everyone always building all parts of the application from the source repository.

dual repository
Figure 37. Source versus package repos

The boundary between source and package is not hard and fast, however. One sometimes sees binary files in source repositories, such as images used in an application. Also, when interpreted languages (such as JavaScript) are “packaged,” they still appear in the package as text files, perhaps compressed or otherwise incorporated into some larger containing structure.

2.6.3. Deployment management

Version control is an important part of the overall concept of configuration management. But configuration management also covers the matter of how artifacts under version control are combined with other IT resources (such as VMs) to deliver services. Configuration management and its components elaborates on Types of version control to depict the relationships.

Deployment basics

Resources in version control in general are not yet active in any value-adding sense. In order for them to deliver experiences, they must be combined with computing resources: servers (physical or virtual), storage, nettworking, and the rest, whether owned by the organization or leased as cloud services. The process of doing so is called deployment. Version control manages the state of the artifacts; meanwhile, deployment management (as another configuration management practice) manages the combination of those artifacts with the needed resources for value delivery.

Configuration management types
Figure 38. Configuration management and its components
Imperative and declarative

Before we turned to source control, we looked at a simple script that changed the configuration of a computer. It did so in an imperative fashion. Imperative and declarative are two important terms from computer science.

A simple example of “declarative” versus “imperative”

Declarative: “Our refrigerator should always have a gallon of milk in it.”

Imperative: “Go out the door, take a right, take a left, go into the building with a big ‘SA’ on it, go in to the last aisle, take a left, go to the third case and take the first container on the fourth shelf from the bottom. Give money to the cashier and bring the container back home.”

In an imperative approach, we tell the computer specifically how we want to accomplish a task, e.g.:

  1. Create a directory

  2. Create some files

  3. Create another directory

  4. Create more files

Many traditional programming languages take an imperative approach. A script such as our example is executed line by line, i.e., it is imperative. In configuring infrastructure, scripting is in general considered “imperative,” but state of the art infrastructure automation frameworks are built using a “declarative,” policy-based approach, in which the object is to define the desired end state of the resource, not the steps needed to get there. With such an approach, instead of defining a set of steps, we simply define the proper configuration as a target, saying (in essence) that “this computer should always have a directory structure thus; do what you need to do to make it so and keep it this way.”

More practically, declarative approaches are used to ensure that the proper versions of software are always present on a system and that configurations such as Internet ports and security settings do not vary from the intended specification.

This is a complex topic, and there are advantages and disadvantages to each approach. (See “When and Where Order Matters” by Mark Burgess for an advanced discussion [43]). But policy-based approaches seem to have the upper hand for now.

2.7. Topics in IT infrastructure

This and the following chapters in this book will end with a “topics” section, in which current and specialized developments will be discussed.

2.7.1. Configuration management, version control, and metadata

Version control, in particular, source control, is where we start to see the emergence of an architecture of IT management. It is in the source control system that we first start to see metadata emerge as an independent concern. Metadata is a tricky term, that tends to generate confusion. The term “meta” implies a concept that is somehow self-referential, and/or operating at a higher-level of abstraction. So,

  • The term meta-discussion is a discussion about the discussion

  • Meta-cognition is cognition about cognition

  • Metadata (aka metadata) is data about data

Some examples:

  • In traditional data management, metadata is the description of the data structures, especially from a business point of view. A database column might be named “CUST_L_NM,” but the business description or metadata would be “the last, family, or surname of the customer.”

  • In document management, the document metadata is the record of who created the document and when, when it was last updated, and so forth. Failure to properly sanitize document metadata has led to various privacy and data security-related issues.

  • In telephony, “data” is the actual call signal — the audio of the phone conversation, nowadays usually digitally encoded. Metadata, on the other hand, is all the information about the call: from whom to who, when, how long, and so forth.

In computer systems, metadata can be difficult to isolate. Sometimes, computing professionals will speak of a “metadata” layer that may define physical database structures, data extracts, business process behavior, even file locations. The trouble is, from a computer’s point of view, a processing instruction is an instruction, and the prefix “meta” has no real meaning.

Because of this, this book favors a principle that metadata is by definition non-runtime. It is documentation, usually represented as structured or semi-structured data, but not usually a primary processing input or output. It might be “digital exhaust” — log files are a form of metadata. It is not executable. If it’s executable (directly or indirectly), it’s digital logic or configuration, plain and simple.

So what about our infrastructure as code example? The artifact — the configuration file, the script — is NOT metadata, because it is executable. But the source repository commit IS metadata. It has no meaning for the script. The dependency is one-way — without the artifact, the commit ID is meaningless, but the artifact is completely ignorant of the commit. The commit may become an essential data point for human beings trying to make sense of the state of a resource defined by that artifact. However, as Loeliger notes in Version control with Git, the version control system:

…doesn’t care why files are changing. That is, the content of the changes doesn’t matter. As the developer, you might move a function from here to there and expect this to be handled as one unitary move. But you could, alternatively, commit the removal and then later commit the addition. Git doesn’t care. It has nothing to do with the semantics of what is in the files" [173].

In this microcosm, we see the origins of IT management. It is not always easy to apply this approach in practice. There can be edge cases. Documentation stored in version control needs to be seen as “executable” in the context of the business process. But, it also does not require or “know about” the commit. Ultimately, the concept of metadata provides a basis for distinguishing the management of IT from the actual application of IT.

2.8. Conclusion

Books and articles are written every week about some aspect of IT and digital infrastructure. We have only scratched the surface in our discussions of computing, network, and storage, and how they have become utility services in the guise of cloud. Software as a Service, Platform as a Service, Infrastructure as a Service -— each represents a different approach. For the most part, we will focus on infrastructure as a service in the remainder of this book, on the assumption that your digital product is unique enough to need the broad freedom this provides.

Digital infrastructure is a rich and complex topic, and a person can spend their career specializing in it. For this class, we always want to keep the key themes of Chapter 1 in mind: why do we want it? How does it provide for our needs, contribute to our enjoyment?

There are numerous sources available to you to learn Linux, Windows, scripting, policy-based configuration management of infrastructure, and source control. Competency with source control is essential to your career, and you should devote some serious time to it. You can find many references to source control on the Internet and in books such as Pro Git by Scott Chacon and Ben Straub [56]. Since source control is the most important foundational technology for professional IT -– whether in a garage start-up or in the largest organizations –— you need to have a deep familiarity with it.

We will discuss further infrastructure issues in Chapter 6, including designing systems for failure and availability.

2.8.1. Discussion questions

  1. Consider your product idea from the previous chapter. Does it have any particular infrastructure needs you can identify, having read this chapter?

  2. Your personal laptop or smartphone is infrastructure. What issues have you had with it? Have you had to change its configuration?

  3. Would you prefer to build your product on an IaaS or PaaS platform (see the cloud models)? Is there an SaaS product that might be of service? (If so, what is your value-adding idea?)

  4. Compare the costs of cloud to owning your own server. Assume you buy a server inexpensively on Ebay and put it in your garage. What other factors might you consider before doing this?

2.8.2. Research & practice

  1. Research cloud providers and recommend on which you would prefer to build your new product on.

  2. Interview someone who has worked in a data center as to what a “day in the life” is, and how it’s changed for them.

  3. Install Vagrant and VirtualBox on your personal device and bring up a Linux VM. Run a Linux tutorial.

  4. Configure a declarative infrastructure manager (Chef, Puppet, Ansible, or SaltStack) to control your Vagrant VMs. Use Git to control your configurations.

  5. Install Docker and run a tutorial on it.

3. Application Delivery

Instructor’s note

This book defers the “theory” of Agile to Parts II and III. So, this chapter presents Agile and related concepts like iterative development without examining the underlying principles. Many students increasingly come in with some exposure to cloud and Agile methods at least, and Chapters 2 and 3 will seem comfortable and familiar. In Chapter 4 and on we challenge them with why Agile works.

3.1. Introduction

Now that we have some idea of IT value (and how we might turn it into a product), and have decided on some infrastructure, we can start building.

IT systems that directly create value for non-technical users are usually called “applications,” or sometimes “services.” As discussed in Chapter 1, they enable value experiences in areas as diverse as consumer banking, entertainment and hospitality, and personal transportation. In fact, it is difficult to think of any aspect of modern life untouched by applications. (This overall trend is sometimes called digital transformation [281]).

Applications are built from software, the development of which is a core concern for any IT-centric product strategy. Software development is a well established career, and a fast-moving field with new technologies, frameworks, and schools of thought emerging weekly, it seems. This chapter will cover applications and the software lifecycle, from requirements through construction, testing, building, and deployment of modern production environments. It also discusses earlier approaches to software development, the rise of the Agile movement, and its current manifestation in the practice of DevOps.

3.1.1. Chapter outline

  • Introduction

  • Learning objectives

  • Basics of applications and their development

    • Defining “application”

    • History of applications and application software

    • Applications and infrastructure: the old way

    • Applications and infrastructure today

  • From waterfall to Agile

  • The DevOps challenge

  • Describing system intent

  • Test-driven development and refactoring

    • Test-driven development

    • Refactoring

  • Continuous integration

  • Continuous deployment

  • Application development topics

  • Conclusion

    • Discussion questions

    • Research & practice

    • Further reading

3.1.2. Learning objectives

  • Understand history and importance of “application” concept

  • Define “Agile” in terms of software development

  • Identify key Agile practices

  • Identify the major components of an end-to-end DevOps delivery pipeline

3.2. Basics of applications and their development

3.2.1. Defining “application”

In keeping with our commitment to theory and first principles, we use an engineering definition of “application.” To an electrical engineer, a toaster or a light bulb is an “application” of electricity (hence the term “appliance”). Similarly, a customer relationship management system, or a Web video on-demand service, are “applications” of the computing infrastructure we studied in the last chapter.

3.2.2. History of applications and application software

old computer room
Figure 39. The ENIAC -— “programmed” by cable reconfiguration.

Without applications, computers would be merely a curiosity. Electronic computers were first “applied” to military needs for codebreaking and artillery calculations. After World War II, ex-military officers like Edmund Berkeley at Prudential realized computers' potential if “applied” to problems like insurance record keeping [9]. At first, such systems required actual manual configuration (see The ENIAC -— “programmed” by cable reconfiguration. [18]), or painstaking programming in complex, tedious, and unforgiving low-level programming languages. As the value of computers became obvious, investment was made in making programming easier through more powerful languages.

The history of software is well documented. Low-level languages (binary and assembler) were increasingly replaced by higher-level languages such as FORTRAN, COBOL, and C. Proprietary machine/language combinations were replaced by open standards and compilers that could take one kind of source code and build it for different hardware platforms. Many languages followed, such as Java, Visual Basic, and JavaScript. Extensive middleware was developed to enable ease of programming, communication across networks, and standardize common functions.

Today, we have extensive frameworks like Apache Struts, Spring, and Ruby on Rails, along with interpreted languages that take much of the friction out of building and testing code. But even today, the objective remains to create a binary executable file or files that computer hardware can “execute,” that is, turn into a computing-based value experience, mediated through devices such as workstations, laptops, smartphones, and their constituent components.

3.2.3. Applications and infrastructure: the old way

In the first decades of computing, any significant application of computing power to a new problem typically required its own infrastructure, often designed specifically for the problem. While awareness existed that computers, in theory, could be “general-purpose,” in practice, this was not so easy. Military/aerospace needs differed from corporate information systems, which differed from scientific and technical uses. And major new applications required new compute capacity.

Take for example when a large organization in 1998 decided to replace its mainframe Human Resources system due to Y2K concerns. Such a system might need to support several thousand users around the world. At that time, PeopleSoft was a frequent choice of software. Implementing such a system was often led by consulting firms such as Deloitte or Andersen Consulting (where one of the authors worked). A typical PeopleSoft package implementation would include:

  • PeopleSoft software, including the PeopleTools framework and various modules written in the framework (e.g.,the well-regarded PeopleSoft HR system)

  • Oracle database software

  • AT&T “Tuxedo” transaction manager

  • Autosys job scheduler

  • HP-UX operating system

  • HP-UX servers, perhaps 20 or so, comprising various “environments” including a production “cluster” consisting of application and database servers

  • EMC storage array

  • Various ancillary software and hardware: management utilities and scripts, backup, networking, etc.

  • Customization of the PeopleSoft HR module and reports by hired consultants, to meet the requirements of the acquiring organization

The software and hardware needed to be specified in keeping with requirements, and acquiring it took lengthy negotiations and logistics and installation processes. Such a project from inception to production might take 9 months (on the short side) to 18 or more months.

Hardware was dedicated and rarely re-used. The HP servers compatible with PeopleSoft might have few other applications if they became surplus. In fact, PeopleSoft would “certify” the infrastructure for compatibility. Upgrading the software might require also upgrading the hardware. In essence, this sort of effort had a strong component of systems engineering, as designing and optimizing the hardware component was a significant portion of the work.

3.2.4. Applications and infrastructure today

Today, matters are quite different, and yet echoes of the older model persist. As mentioned, ANY compute workloads are going to incur economic cost. However, capacity is being used more efficiently and can be provisioned on-demand. Currently, it is a significant application indeed that merits its own systems engineering.

To “provision” in an IT sense means to make the needed resources or services available for a particular purpose or consumer.

Instead, a variety of mechanisms (as covered in the previous chapter’s discussion of cloud systems) enable the sharing of compute capacity, the raw material of application development. The fungibility and agility of these mechanisms increase the velocity of creation and evolution of application software. For small and medium sized applications, the overwhelming trend is to virtualize and run on commodity hardware and operating systems. Even 15 years ago, non-trivial websites with database integration would be hosted by internal PaaS clusters at major enterprises (for example, Microsoft ASP, COM+, and SQL Server clusters could be managed as multi-tenant -— the author wrote systems on such a platform).

The general-purpose capabilities of virtualized public and private cloud today are robust. Assuming the organization has the financial capability to purchase computing capacity in anticipation of use, it can be instantly available when the need surfaces. Systems engineering at the hardware level is more and more independent of the application lifecycle; the trend is towards providing compute as a service, carefully specified in terms of performance, but NOT particular hardware. Hardware physically dedicated to a single application is rarer, and even the largest engineered systems are more standardized so that they may one day benefit from cloud approaches. Application architectures have also become much more powerful. Interfaces (interaction points for applications to exchange information with each other, generally in an automated way) are increasingly standardized. Applications are designed to scale dynamically with the workload and are more stable and reliable than in years past.

In the next section, we will discuss how the practices of application development have evolved to their current state.

3.3. From waterfall to Agile

This is not a book on software development per se, nor on Agile development. There are hundreds of books available on those topics. But, no assumption is made that the reader has any familiarity with these topics, so some basic history is called for. (If you have taken an introductory course in software engineering, this will likely be a review).

For example, when a new analyst would join the systems integrator Andersen Consulting (now Accenture) in 1998, they would be schooled in something called the Business Integration Method (BIM). The BIM was a classic expression of what is called “waterfall development."

What is waterfall development? It is a controversial question. Walker Royce, the original theorist who coined the term named it in order to critique it [231]. Military contracting and management consultancy practices, however, embraced it, as it provided an illusion of certainty. The fact that computer systems until recently included a substantial component of hardware systems engineering may also have contributed.

Waterfall development as a term has become associated with a number of practices. The original illustration was similar to Waterfall lifecycle [19]):

waterfall progression
Figure 40. Waterfall lifecycle

First, requirements need to be extensively captured and analyzed before the work of development can commence. So, the project team would develop enormous spreadsheets of requirements, spending weeks on making sure that they represented what “the customer” wanted. The objective was to get the customer’s signature. Any further alterations could be profitably billed as “change requests.”

The analysis phase was used to develop a more structured understanding of the requirements, e.g., conceptual and logical data models, process models, business rules, and so forth.

In the design phase, the actual technical platforms would be chosen; major subsystems determined with their connection points, initial capacity analysis (volumetrics) translated into system sizing, and so forth. (Perhaps hardware would not be ordered until this point, leading to issues with developers now being “ready,” but hardware not being available for weeks or months yet).

Only AFTER extensive requirements, analysis, and design would coding take place (implementation). Furthermore, there was a separation of duties between developers and testers. Developers would write code and testers would try to break it, filing bug reports that the developers would then need to respond to.

Another model sometimes encountered at this time was the V-model (see V-model [20]). This was intended to better represent the various levels of abstraction operating in the systems delivery activity. Requirements operate at various levels, from high-level business intent through detailed specifications. It is all too possible that a system is “successfully” implemented at lower levels of specification, but fails to satisfy the original higher-level intent.

Figure 41. V-model

The failures of these approaches at scale are by now well known. Large distributed teams would wrestle with thousands of requirements. The customer would “sign off” on multiple large binders, with widely varying degrees of understanding of what they were agreeing to. Documentation became an end in itself and did not meet its objectives of ensuring continuity if staff turned over. The development team would design and build extensive product implementations without checking the results with customers. They would also defer testing that various component parts would effectively interoperate until the very end of the project, when the time came to assemble the whole system.

Failure after failure of this approach is apparent in the historical record [107]. Recognition of such failures, dating from the 1960s, led to the perception of a “software crisis.” (However, many large systems were effectively constructed and operated during the “waterfall years," and there are reasonable criticisms of the concept of a “software crisis” [34]).

Successful development efforts existed back to the earliest days of computing (otherwise, we probably wouldn’t have computers, or at least not so many). Many of these successful efforts used prototypes and other means of building understanding and proving out approaches. But highly publicized failures continued, and a substantial movement against “waterfall” development started to take shape.

By the 1990s, a number of thought leaders in software development had noticed some common themes with what seemed to work and what didn’t. Kent Beck developed a methodology known as “eXtreme Programming” (XP) [19]. XP pioneered the concepts of iterative, fast-cycle development with ongoing stakeholder feedback, coupled with test-driven development, ongoing refactoring, pair programming, and other practices. (More on the specifics of these in the next section).

Various authors assembled in 2001 and developed the Agile Manifesto [6], which further emphasized an emergent set of values and practices:

The Agile Manifesto We are uncovering better ways of developing software by doing it and helping others do it. Through this work we have come to value:

  • Individuals and interactions over processes and tools

  • Working software over comprehensive documentation

  • Customer collaboration over contract negotiation

  • Responding to change over following a plan

That is, while there is value in the items on the right, we value the items on the left more.

The Manifesto authors further stated:

We follow these principles:

  • Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.

  • Welcome changing requirements, even late in development. Agile processes harness change for the customer’s competitive advantage.

  • Deliver working software frequently, from a couple of weeks to a couple of months, with a preference to the shorter timescale.

  • Business people and developers must work together daily throughout the project.

  • Build projects around motivated individuals. Give them the environment and support they need, and trust them to get the job done.

  • The most efficient and effective method of conveying information to and within a development team is face-to-face conversation.

  • Working software is the primary measure of progress.

  • Agile processes promote sustainable development. The sponsors, developers, and users should be able to maintain a constant pace indefinitely.

  • Continuous attention to technical excellence and good design enhances agility.

  • Simplicity—​the art of maximizing the amount of work not done—​is essential.

  • The best architectures, requirements, and designs emerge from self-organizing teams.

  • At regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly.

Agile methodologists emphasize that software development is a learning process. In general, learning (and the value derived from it) is not complete until the system is functioning to some degree of capability. As such, methods that postpone the actual, integrated verification of the system increase risk. Alistair Cockburn visualizes risk as the gap between the ongoing expenditure of funds and the lag in demonstrating valuable learning (see Waterfall risk [21]]).

waterfall risk curve
Figure 42. Waterfall risk

Because Agile approaches emphasize delivering smaller batches of complete functionality, this risk gap is minimized (Agile risk [22]]).

Agile risk curve
Figure 43. Agile risk

The Agile models for developing software aligned with the rise of cloud and Web-scale IT. As new customer-facing sites like Flickr, Amazon, Netflix, Etsy, and Facebook scaled to massive proportions, it became increasingly clear that waterfall approaches were incompatible with their needs. Because these systems were directly user-facing, delivering monetized value in fast-moving competitive marketplaces, they required a degree of responsiveness previously not seen in “back-office” IT or military-aerospace domains (the major forms that large scale system development had taken to date). We will talk more of product-centricity and the overall DevOps movement in the next section.

This new world did not think in terms of large requirements specifications. Capturing a requirement, analyzing and designing to it, implementing it, testing that implementation, and deploying the result to the end user for feedback became something that needed to happen at speed, with high repeatability. Requirements “backlogs” were (and are) never “done,” and increasingly were the subject of ongoing re-prioritization, without high-overhead project “change” barriers.

These user-facing, web-based systems integrate the software development lifecycle tightly with operational concerns. The sheer size and complexity of these systems required much more incremental and iterative approaches to delivery, as the system can never be taken offline for the “next major release” to be installed. New functionality is moved rapidly in small chunks into a user-facing, operational status, as opposed to previous models where vendors would develop software on an annual or longer version cycle, to be packaged onto media for resale to distant customers.

Contract software development never gained favor in the Silicon Valley web-scale community; developers and operators are typically part of the same economic organization. So, it was possible to start breaking down the walls between “development” and “operations,” and that is just what happened.

Large scale systems are complex and unpredictable. New features are never fully understood until they are deployed at scale to the real end user base. Therefore, large scale web properties also started to “test in production” (more on this in Chapter 6) in the sense that they would deploy new functionality to only some of their users. Rather than trying to increase testing to understand things before deployment better, these new firms accepted a seemingly higher-level of risk in exposing new functionality sooner. (Part of their belief is that it actually is lower risk because the impacts are never fully understood in any event).

We’ll return to Agile and its various dimensions throughout the rest of this book. See [167] for a much more thorough history.

3.4. The DevOps challenge

Consider this inquiry by Mary and Tom Poppendieck:

How long would it take your organization to deploy a change that involved one single line of code? Do you deploy changes at this pace on a repeat, reliable basis? [212 p. 92]

The implicit goal is that the organization should be able to change and deploy one line of code, from idea to production, and in fact, might want to do so on an ongoing basis. There is deep Lean/Agile theory behind this objective; a theory developed in reaction to the pattern of massive software failures that characterized IT in the first 50 years of its existence. (We’ll discuss some of the systems theory, including the concept of feedback, in the introduction to Part II and other aspects of Agile theory, including the ideas of Lean Product Development, in Parts II and III).

Achieving this goal is feasible but requires new approaches. Various practitioners have explored this problem, with great success. Key initial milestones included:

  • The establishment of “test-driven development” as a key best practice in creating software [19]

  • Duvall’s book “Continuous Integration” [86]

  • Allspaw & Hammonds’s seminal “10 Deploys a Day” presentation describing Etsy [10]

  • Humble & Farley’s “Continuous Delivery” [128]

DevOps concerns
Figure 44. DevOps definition

“DevOps” is a broad term, encompassing product management, continuous delivery, team behaviors, and culture (see DevOps definition). Some of these topics will not be covered until parts II and III in this book. At an execution level, the fundamental goal of moving smaller changes more quickly through the pipeline is a common theme. Other guiding principles include, “If it hurts, do it more frequently.” (This is in part a response to the poor practice, or antipattern, of deferring integration testing and deployment until those tasks are so big as to be unmanageable). There is a great deal written on the topic of DevOps currently; the Humble/Farley book is recommended as an introduction. Let’s go into a little detail on some essential Agile/DevOps practices.

  • Test-driven development

  • Ongoing refactoring

  • Continuous integration

  • Continuous deployment

In our scenario approach, at the end of the last chapter, you had determined a set of tools for creating your new IT-based product:

  • Development stack (language, framework, and associated enablers such as database and application server)

  • Cloud provider that supports your stack

  • Version control

  • Deployment capability

You’ll be creating text files of some sort, and almost certainly importing various additional libraries, packages, modules, etc., rather than solving problems others have already figured out.

Development tools such as text editors and Integrated Development Environments (IDEs) are out of scope for this book, as they are often matters of personal choice and limited to developers’ desktops.

The assumption in this chapter is that you are going to start IMMEDIATELY with a continuous delivery pipeline. You want to set this up before developing a single line of code. This is not something to “get around to later.” It’s not that difficult (see the online resources for further discussion and pointers to relevant open-source projects). What is meant by a continuous delivery pipeline? A simple continuous delivery toolchain presents a simplified, starting overview, based on the Calavera project developed for the IT Delivery course at the University of St. Thomas in St. Paul, Minnesota [27].

Figure 45. A simple continuous delivery toolchain
  1. First, some potential for value is identified. It is refined through product management techniques into a feature — some specific set of functionality that when complete will enable the value proposition (i.e. as a moment of truth).

  2. The feature is expressed as some set of IT work, today usually in small increments lasting between one and four weeks (this of course varies). Software development commences, e.g., the creation of Java components by developers who first write tests, and then write code that satisfies the test.

  3. The developer is continually testing the software as the build progresses, and keeping a local source control repository up to date with their changes at all times. When a level of satisfaction is reached with the software, it is submitted to a centralized source repository.

  4. When the repository detects the new “check-in,” it contacts the build choreography manager, which launches a dedicated environment to build and test the new code. The environment is likely configured using "infrastructure as code" techniques; in this way, it can be created automatically and quickly.

  5. If the code passes all tests, the compiled and built binary executables may then be “checked in” to a package management repository.

  6. From the package repository, the code may then be deployed to various environments, for further testing and ultimately to “production,” where it can enable the consumer’s value experiences.

  7. Finally, the production system is monitored for availability and performance.

We will discuss DevOps in terms of team behaviors and culture later in the book. For now, we stay closer to the tactical and technical concerns of continuous delivery. Let’s go into more detail on the major phases.

3.5. Describing system intent

So, you’ve got an idea for a product value experience, and you have tools for creating it and infrastructure for running it. It’s time to start building a shippable product. As we will cover in more detail in the next chapter, the product development process starts with a concept of requirements (whether we call it a story, use case, or scenario is not important). Requirements are numerous and evolving, and we’re going to take some time looking at the process of converting them into IT-based functionality. There is history here back to the earliest days of computing.

In order to design and build a digital product, you need to express what you need the product to do. The conceptual tool used to do this is called Requirement. The literal word “Requirement” has fallen out of favor with the rise of Agile [209], and has a number of synonyms and variations:

  • Use case

  • User story

  • Non-functional requirement

  • Epic

  • Architectural epic

  • Architectural requirement

While these may differ in terms of focus and scope, the basic concept is the same — the requirement, however named, expresses some intent or constraint the system must fulfill. This intent calls for work to be performed.

Sidebar: The troubled term “requirements"

In earlier times, the concept of “requirements” was often used as a sort of defense mechanism. Statements would often be heard such as:

“We can’t start building anything; we don’t fully understand the requirements.”

“We can’t change the requirements now, we’ve started building! Make up your mind!”

“The product is a failure because the business kept changing their mind about the requirements.”

While the term “requirements” is still used throughout much education and training, the student should be aware of this history, and the fact that many Agile practitioners discourage use of the term.

As Jeff Patton says, "… I learned the word requirements actually means shut up.” [209 p. 452].

User Story Mapping is a well known approach [209] with origins in the Scrum community. Here is an example from [67]:

“As a shopper, I can select how I want items shipped based on the actual costs of shipping to my address so that I can make the best decision.”

The basic format is,

As a < type of user >, I want < goal >, so that < some value >.

The story concept is flexible and can be aggregated and decomposed in various ways, as we will discuss in Chapter 4. Our interest here is in the basic stimulus for application development work that it represents.

You don’t need an extensively automated system at this stage to capture requirements, but you need something. It could be a spreadsheet, or a shared word processing document, or sticky notes on a white board (we’ll talk about Kanban in the next section). The important thing is to start somewhere, with team agreement as to what the approach is, so you can move forward collaboratively.

We will discuss approaches for “discovering” user stories and product features in Chapter 4, where Product Management is formalized. For now, as an early startup of one or two people, it is sufficient that you have some basic ability to characterize your system intent -— more formalized techniques come later.

3.6. Test-driven development and refactoring

Testing software and systems is a critically important part of digital product development. The earliest concepts of waterfall development called for it explicitly, and “software tester” as a role and “software quality assurance” as a practice have long histories. Evolutionary approaches to software have a potential major issue with software testing:

As a consequence of the introduction of new bugs, program maintenance requires far more system testing per statement written than any other programming. Theoretically, after each fix one must run the entire bank of test cases previously run against the system, to ensure that it has not been damaged in an obscure way. In practice, such regression testing must indeed approximate this theoretical ideal, and it is very costly.
— Fred Brooks
Mythical Man-Month

This issue was and is well known to thought leaders in Agile software development. The key response has been the concept of automated testing so that any change in the software can be immediately validated before more development along those lines continues. One pioneering tool was JUnit:

The reason JUnit is important … is that the presence of this tiny tool has been essential to a fundamental shift for many programmers. A shift where testing has moved to a front and central part of programming. People have advocated it before, but JUnit made it happen more than anything else.
— Martin Fowler

From the reality that regression testing was “very costly” (as stated by Brooks in the above quote), the emergence of tools like JUnit (coupled with increasing computer power and availability) changed the face of software development, allowing the ongoing evolution of software systems in ways not previously possible.

3.6.1. Test-driven development

In test-driven development, the idea essence is to write code that tests itself, and in fact to write the test before writing any code. This is done through the creation of test harnesses and the tight association of tests with requirements. The logical culmination of test-driven development was expressed by Kent Beck in eXtreme Programming: write the test first [19]. Thus:

  1. Given a “user story” (i.e., system intent) and figure out a test that will demonstrate its successful implementation.

  2. Write this test using the established testing framework

  3. Write the code that fulfills the test

Some readers may be thinking, “I know how to write a little code, but what is this about using code to write a test?”

While we avoid much in-depth examination of source code in this book, using some simplified Java will help. Here is an example drawn from the original Calavera project, the basis for the companion labs to this book.

Just read through the example carefully. You do not need to know Java.

Let’s say we want a function that will take a string of characters (e.g.,a sentence) and wrap it in some HTML “Heading 1” tags. We will name the class “H1Class” and (by convention) we will start by developing a class called TestH1Class.

We write the test first:

public class TestClass1 {
 private H1Class a;  //
 public void setUp() throws Exception {
  this.a = new H1Class(“TestWebMessage”);
  public void testTrue() {
    assertEquals(“string correctly generated",
     this.a.webMessage());// string built correctly

The code above basically states,

  1. Set up the object to be tested.

  2. Pass in a message with the content “TestWebMessage”.

  3. The test passess if we get back "<h1>TestWebMessage</h1>” — the original message surrounded by <h1> and </h1> “tags,” which are part of HTML.

We run the test (e.g.,through JUnit and Ant, which we won’t detail here). It will fail. Then, we write the class:

 public class H1Class {
  String strMsg;
  public String webMessage()
      return "<h1>” + strMsg + "</h1>";
These are simplified examples.

When we run the test harness correctly (e.g.,using a build tool such as Ant or Maven), the test class will perform the following actions:

  1. Create an instance of the class H1Class, based on a string “TestWebMessage”.

  2. Confirm that the returned string is “<h1>TestWebMessage</h1>”.

If that string is not correctly generated, or the class cannot be created, or any other error occurs, the test fails and this is then reported via error results at the console, or (in the case of automated build) will be detected by the build manager and displayed as the build outcome. Other languages use different approaches from that shown here, but every serious platform at this point supports test-driven development.

The associated course lab provides a simple but complete example of a test-driven development environment, based on lightweight virtualization.

Employing test-driven development completely and correctly requires thought and experience. But it has emerged as a practice in the largest scale systems in the world. Google runs many millions of automated tests daily [283]. It has even been successfully employed in hardware development [113].

3.6.2. Refactoring

Refactoring is a controlled technique for improving the design of an existing code base. Its essence is applying a series of small behavior-preserving transformations, each of which is “too small to be worth doing." However, the cumulative effect of each of these transformations is quite significant. By doing them in small steps, you reduce the risk of introducing errors. You also avoid having the system broken while you are carrying out the restructuring — which allows you to refactor a system over an extended period of time gradually.
— Martin Fowler
Refactoring -—

Test-driven development enables the next major practice, that of refactoring. Refactoring is how you address technical debt. What is technical debt? Technical debt is a term coined by Ward Cunningham and is now defined by Wikipedia as

… the eventual consequences of poor system design, software architecture, or software development within a code base. The debt can be thought of as work that needs to be done before a particular job can be considered complete or proper. If the debt is not repaid, then it will keep on accumulating interest, making it hard to implement changes later on … Analogous to monetary debt, technical debt is not necessarily a bad thing, and sometimes technical debt is required to move projects forward. [285]

Test-driven development ensures that the system’s functionality remains consistent, while refactoring provides a means to address technical debt as part of ongoing development activities. Prioritizing the relative investment of repaying technical debt versus developing new functionality will be examined in future sections, but at least you now know the tools and concepts.

We discuss technical debt further in Chapter 12.

3.7. Continuous integration

3.7.1. Version control, again: branching and merging

Oddly enough, it seems that when you run into a painful activity, a good tip is to do it more often.
— Martin Fowler
Foreword to Paul Duvall's Continuous Integration
2 devs 3 files
Figure 46. Two developers, one file

As systems engineering approaches transform to cloud and infrastructure as code, a large and increasing percentage of IT work takes the form of altering text files and tracking their versions. We have seen this in the previous chapter, with artifacts such as scripts being created to drive the provisioning and configuring of computing resources. Approaches which encourage ongoing development and evolution are increasingly recognized as less risky since systems do not respond well to big “batches” of change. An important concept is that of “continuous integration,” popularized by Paul Duvall in his book of the same name [86].

In order to understand why continuous integration is important, it is necessary to discuss further the concept of source control and how it is employed in real-world settings. Imagine Mary has been working for some time with her partner Aparna in their startup (or on a small team) and they have three code modules (see Two developers, one file). Mary is writing the web front end (file A), Aparna is writing the administrative tools and reporting (file C), and they both partner on the data access layer (file B). The conflict, of course, arises on file B that they both need to work on. A and C are mostly independent of each other, but changes to any part of B can have an impact on both their modules.

If changes are frequently needed to B, and yet they cannot split it into logically separate modules, they have a problem; they cannot both work on the same file at the same time. They are each concerned that the other does not introduce changes into B that “break” the code in their own modules A and C.

2 devs on same file
Figure 47. File B being worked on by two people

In smaller environments, or under older practices, perhaps there is no conflict, or perhaps they can agree to take turns. But even if they are taking turns, Mary still needs to test her code in A to make sure it’s not been broken by changes Aparna made in B. And what if they really both need to work on B (see File B being worked on by two people) at the same time?

Now, because they took this book’s advice and didn’t start developing until they had version control in place, each of them works on a “local” copy of the file (see illustration “File B being worked on by two people”).

That way, they can move ahead on their local workstations. But when the time comes to combine both of your work, they may find themselves in “merge hell.” They may have chosen very different approaches to solving the same problem, and code may need massive revision to settle on one code base. For example, in the accompanying illustration, Mary’s changes to B are represented by triangles and Aparna’s are represented by circles. They each had a local version on their workstation for far too long, without talking to each other.

Breaking a system apart by “layer” (e.g.,front end versus data access) does not scale well. Microservices approaches encourage keeping data access and business logic together in functionally cohesive units. More on this in future chapters. But in this example, both developers are on the same small team. It is not always possible (or worth it) to divide work to keep two people from ever needing to change the same thing.

In the diagrams, we represent the changes graphically; of course, with real code, the different graphics represent different development approaches each person took. For example, Mary had certain needs for how errors were handled, while Aparna had different needs.

merge hell
Figure 48. Merge hell

In Merge hell, where triangles and circles overlap, Mary and Aparna painstakingly have to go through and put in a consolidated error handling approach, so that the code supports both of their needs. The problem, of course, is there are now three ways errors are being handled in the code. This is not good, but they did not have time to go back and fix all the cases. This is a classic example of technical debt.

Suppose instead that they had been checking in every day. They can identify the first collision quickly (see Catching errors quickly is valuable), and have a conversation about what the best error handling approach is. This saves them both the rework of fixing the collisions, and the technical debt they might have otherwise accepted:

errors caught quickly
Figure 49. Catching errors quickly is valuable

These problems have driven the evolution of software configuration management for decades. In previous methods, to develop a new release, the code would be copied into a very long-lived “branch” (a version of the code to receive independent enhancement). Ongoing “maintenance” fixes of the existing code base would also continue, and the two code bases would inevitably diverge. Switching over to the “new” code base might mean that once-fixed bugs (bugs that had been addressed by maintenance activities) would show up again, and logically, this would not be acceptable. So, when the newer development was complete, it would need to be merged back into the older line of code, and this was rarely if ever easy (again, “merge hell”). In a worst case scenario, the new development might have to be redone.

Big bang _versus_continuous integration
Figure 50. Big bang versus continuous integration

Enter continuous integration (see Big bang versus continuous integration). As presented in [86] the key practices (you will notice similarities to the pipeline discussion) include:

  • Developers run private builds including their automated tests before committing to source control

  • Developers check in to source control at least daily (hopefully, we have been harping on this enough that you are taking it seriously by now)

    • Distributed version control systems such as git are especially popular, although older centralized products are starting to adopt some of their functionality

    • Integration builds happen several times a day or more on a separate, dedicated machine

  • 100% of tests must pass for each build, and fixing failed builds is the highest priority

  • A package or similar executable artifact is produced for functional testing

  • A defined package repository exists as a definitive location for the build output

These practices are well developed and represent a highly evolved understanding gained through the painful trial and error of many development teams over many years. Rather than locking a file so that only one person can work on it at a time, it’s been found that the best approach is to allow developers to actually make multiple copies of such a file or file set and work on them simultaneously. Wait, you say. How can that work?

This is the principle of continuous integration at work. If the developers are continually pulling each other’s work into their own working copies, and continually testing that nothing has broken, then distributed development can take place. So, if you are a developer, the day’s work might be as follows:

8 AM: check out files from master source repository to a local branch on your workstation. Because files are not committed unless they pass all tests, you know that you are checking out clean code. You pull user story (requirement) that you will now develop.

8:30 AM: You define a test and start developing the code to fulfill it.

10 AM: You are closing in on wrapping up the first requirement. You check the source repository. Your partner has checked in some new code, so you pull it down to your local repository. You run all the automated tests and nothing breaks, so you’re fine.

10:30 AM: You complete your first update of the day; it passes all tests on your workstation. You commit it to the master repository. The master repository is continually monitored by the build server, which takes the code you created and deploys it, along with all necessary configurations, to a dedicated build server (which might be just a virtual machine or transient container). All tests pass there (the test you defined as indicating success for the module, as well as a host of older tests that are routinely run whenever the code is updated).

11:00 AM: Your partner pulls your changes into their working directory. Unfortunately, some changes you made conflict with some work they are doing. You briefly consult and figure out a mutually acceptable approach.

Controlling simultaneous changes to a common file is only one benefit of continuous integration. When software is developed by teams, even if each team has its own artifacts, the system often fails to “come together” for higher-order testing to confirm that all the parts are working correctly together. Discrepancies are often found in the interfaces between components; when component A calls component B, it may receive output it did not expect and processing halts. Continuous integration ensures that such issues are caught early.

3.7.2. Build choreography

Go back to the pipeline picture and consider step 4. While we discussed version control, package management, and deployment management in Chapter 2, this is our first encounter with build choreography.

DevOps and continuous delivery call for automating everything that can be automated. This goal led to the creation of build choreography managers such as Hudson, Jenkins, Travis CI, and Bamboo. Build managers may control any or all of the following steps:

  • Detecting changes in version control repositories and building software in response

  • Alternately, building software on a fixed (e.g.,nightly) schedule

  • Compiling source code and linking it to libraries

  • Executing automated tests

  • Combining compiled artifacts with other resources into installable packages

  • Registering new and updated packages in the package management repository, for deployment into downstream environments

  • In some cases, driving deployment into downstream environments, including production. (This can be done directly by the build manager, or through the build manager sending a message to a deployment management tool).

Build managers play a critical, central role in the modern, automated pipeline and will likely be a center of attention for the new digital professional in their career.

3.8. Releasing software

3.8.1. Continuous deployment

(see Deployment)

Figure 51. Deployment

Once the software is compiled and built, the executable files that can be installed and run operationally should be checked into a Package Manager. At that point, the last mile steps can be taken, and deploy the now tested and built software to pre-production or production environments (see Deployment). The software can undergo usability testing, load testing, integration testing, and so forth. Once those tests are passed, it can be deployed to production.

(What is “production,” anyway? We’ll talk about environments in Section 2. For now, you just need to know that when an IT-based product is “in production,” that means it is live and available to its intended base of end users or customers).

Moving new code into production has always been a risky procedure. Changing a running system always entails some uncertainty. However, the practice of infrastructure as code coupled with increased virtualization has reduced the risk. Often, a rolling release strategy is employed so that code is deployed to small sets of servers while other servers continue to service the load. This requires careful design to allow the new and old code to co-exist at least for a brief time.

This is important so that the versions of software used in production are well controlled and consistent. The package manager can then be associated with some kind of deploy tool that keeps track of what versions are associated with which infrastructure.

Timing varies by organization. Some strive for true “continuous deployment”, in which the new code flows seamlessly from developer commit through build, test, package and deploy. Others put gates in between the developer and check-in to mainline, or source-to-build, or package-to-deploy so that some human governance remains in the toolchain. We will go into more detail on these topics in Chapter 6.

3.8.2. The concept of “release”

Release management, and the concept of a “release,” are among the most important and widely-seen concepts in all forms of digital management. Regardless of whether you are a cutting-edge Agile startup with 2 people or one of the largest banks with a portfolio of thousands of applications, you will likely be using releases for coordination and communication.

What is a “release?” Betz defined it this way in other work: “A significant evolution in an IT service, often based on new systems development, coordinated with affected services and stakeholders.” Release management’s role is to “Coordinate the assembly of IT functionality into a coherent whole and deliver this package Into a state in which the customer is getting the intended value”, [23 p. 68, 23 p. 119].

Even as a startup of one, you know when your product is “ready for prime time” or not. Your first “release” should be a cause for some celebration: you now are offering digital value to some part of the world.

We will talk much more about release management in Parts II and III. At this point, you may not see it as much different from simply deploying, but even at the smallest scale, you start to notice that some changes require more thought and communication than others. Simple bugfixes, or self-explanatory changes to the product’s interfaces, can flow through your pipeline. But if you are going to change the system radically -— even if you only have one customer -— you need to communicate with them about this. And so, the concept of release management becomes part of your world.

3.9. Application development topics

3.9.1. Application architecture

software architecture
Figure 52. Software architecture tool

The design and architecture of applications is a large topic (see, for example, [104, 96]) and this text only touches lightly on it. ISO/IEC 42010 defines architecture as "The fundamental concepts or properties of a system in its environment embodied in its elements, relationships, and in the principles of its design and evolution [144]."

A computer program can be as simple as Hello World. Such a program requires only one or a few files to compile and execute. However, significant applications and systems require hardware and software configurations of tremendous complexity. Specialized visual notations are used to describe this complexity (see Software architecture tool [23]). We will discuss this further in Chapter 12.

3.9.2. Applications and project management

Because the initial applications were implemented as a kind of systems engineering and were expensive to build and maintain, the technique of choice was project management. Project management will not be covered in this book until Part III, as it is not appropriate to the earlier stages of this book’s emergence model.

This history of project managed systems engineering produced any number of successes, but by the early 1990s there were significant concerns about the rate of large project failures [107], which occurred despite seemingly extensive and rigorous bureaucratic overhead, evidenced by frameworks such as CMMI and PMBOK. Both project management and CMMI have come in for significant criticism [162, 150] and will be discussed further in Sections 3 and 4.

3.10. Conclusion

Applications are why computers exist. Supporting applications are increasingly less about systems engineering, and more about quickly provisioning standard, shared infrastructure. Application development has moved decisively in the past 20 years to Agile delivery models, based on techniques such as:

  • Story mapping

  • Test-driven development

  • Refactoring

  • Continuous integration

  • Continuous deployment

Application delivery, software development, and the Agile movement are broad, complex, and evolving topics. For those of you familiar with Agile, we have only scratched the surface in this chapter. In future chapters, we will go into more detail on topics such as:

  • Product management, including behavior-driven development and continuous design (much more on requirements, user stories, etc.).

  • The importance of feedback

  • Prioritization and cost of delay

  • Scrum and Kanban

  • Tracking tasks and effort

  • Closing the loop from operations to development, and coping with interrupt-driven work

and much more.

3.10.1. Discussion questions

  • What is your exposure to application programming?

  • Can you think of examples of waterfall and agile approaches in your daily life (not necessarily related to IT?)

  • Have you been on a project that needed more planning (IT or not)? For example, have you ever gone to the hardware store 5 times in one day, and felt by the end that you should have thought a little more at first?

  • Have you ever been in a situation where planning never seemed to end? Why?

  • If you are a developer, read Things You Should Never Do, Part I. What do you think? Are you ever tempted to re-write something instead of figuring out how it works?

3.10.2. Research & practice

  • Review the debates over Agile in IEEE Software in the early 2000s and write a retrospective report on the thinking at the time.

  • Review Amazon’s AWS CodePipeline

Part I Conclusion

We are now at the end of the first section, and one quarter the way through this book (and this course, assuming you are taking a semester-long treatment).

You are now the proud leader of a functioning startup. You have decided to provide some product that at least partially depends on an IT-based component that you need to actively develop.

You understand the value that IT brings and your own product’s needs for it. You have chosen at least a functioning platform for initial development, without falling into the trap of analysis paralysis, although at this point you should be keeping your options open if your initial platform choice doesn’t prove out.

Finally, you have implemented at least a lightweight continuous delivery pipeline. You didn’t need to spend any money doing this, as so much powerful technology is freely available. In particular, from the start, you have taken version control very seriously and have a stable, backed up source repository as a basis for your product development.

You also have at least rudimentary systems for tracking requirements, building your software, storing your packages, and deploying them to your production environment.

Congratulations, you’ve got all the basics in place. Your product is starting to attract sales and/or investors, and you’ve hired a few more people. Let’s talk about collaboration.

Staying at Level 1 In your career, many — perhaps even the majority — of the people you meet and work with will be focused on Level 1 in their thinking and approach. This is a fine thing and to be expected. Level 1 is where the real work is done.

Part II — Team

Scenario Your startup has met with some success, and is now a team. (If you are in an enterprise, you’ve been promoted to team lead). You’ve moved out of the garage into a more professionalized environment. The team has a single mission and a cohesive identity, and you don’t need a lot of overhead to get the job done.

Even with a few new people comes the need to more clearly establish your product direction, so people are building the right thing. You’re all in the same location, and can still communicate informally, but there is enough going on that you need a more organized approach to getting work done. Finally, this great thing you’re building doesn’t mean much if people cannot understand how best to use it, or if it’s not running and the right people can’t get to it.

Things are getting larger and more complex. You have a significant user base, and the founder is increasingly out meeting with users, customers, and investors. As a result, she isn’t in the room with the product team as much any more; in fact, she just named someone to be “product owner,” and what is that all about?

The practices and approaches established at the team level are critical to the higher levels of scale discussed in Sections 3 and 4. In this section, we discuss small, cross-functional, outcome-oriented teams. We discuss product management, work management, shared mental models, visualization, and systems monitoring. We discuss collaboration and customer intimacy, and the need to limit work in process. And we discuss blameless cultures where people are safe to fail and learn. All of these are critical foundations for future growth; scaling success starts with building a strong team level.

Special section: Systems thinking and feedback One of the most important aspects of DevOps and Agile is "systems thinking," and even a small team building one digital product can be viewed as a system. We talk of information systems, but what is a "system"? What is feedback? There is a rich body of knowledge describing these topics, which we will touch on in this special section.

Chapter 4: Product Management

You (as the startup leader) are spending more time with investors and customers, and maintaining alignment around your original product vision is becoming more challenging as you are pulled in various directions. You need some means of keeping the momentum here. And the concept of “product management,” you’re finding, represents a rich set of ideas for managing your team’s efforts at this stage of the game.

Chapter 5: Work Management

Even with a small team of five people (let alone eight or nine), it’s too easy for balls to get dropped as work moves between key contributors. You probably don’t need a complex software-based process management tool yet, but you do need some way of managing work in process. And you start to understand that work takes many forms and exists as a concept at different scales.

Chapter 6: Operations Management

Since Chapter 3, your application developers have been running your systems and even answering the occasional phone call from customers. You’re big enough that you need a bit more specialization. You’ve got dedicated support staff answering the phone calls, and you are finding that, even if you rotate operational responsibilities across developers, it is still a distinct kind of “interrupt-driven” work that is not compatible with heads-down, focused software development. You’ve probably seen by now that complex systems are fragile and tend to fail; how you learn (or don’t) from those failures is a critical question.

Part II, like the other parts, needs to be understood as a unified whole. In reality, startups struggle with the issues in all three chapters simultaneously.

Special section: Systems thinking and feedback

So, what is a system? A system is a set of things—people, cells, molecules, or whatever—interconnected in such a way that they produce their own pattern of behavior over time. The system may be buffeted, constricted, triggered, or driven by outside forces. But the system’s response to these forces is characteristic of itself, and that response is seldom simple in the real world.
— Donella Meadows
Thinking in Systems

Systems thinking, and systems theory, are broad topics extending far beyond IT and the digital profession. Donella Meadows defined a system as “an interconnected set of elements that is coherently organized in a way that achieves something” [184]. Systems are more than the sum of their parts; each part contributes something to the greater whole, and often the behavior of the greater whole is not obvious from examining the parts of the system.

Systems thinking is an important influence on digital management. Digital systems are complex, and when the computers and software are considered as a combination of the people using them, we have a sociotechnical system. Digital systems management seeks to create, improve, and sustain these systems.

A digital management capability is itself a complex system. While the term “Information Systems (IS)” was widely replaced by “information technology (IT)” in the 1990s, do not be fooled. Enterprise IT is a complex sociotechnical system, that delivers the digital services to support a myriad of other complex sociotechnical systems.

The Merriam-Webster dictionary defines a system as “a regularly interacting or interdependent group of items forming a unified whole." These interactions and relationships quickly take center stage as you move from individual work to team efforts. Consider that while a two member team only has one relationship to worry about, a ten member team has 45, and a 100 person team has 4,950!

A thorough discussion of systems theory is beyond the scope of this book. However, many of the ideas that follow are informed by it. Obtaining a working knowledge of systems theory will not only enhance your understanding of this book, but it can also be an essential tool for managing uncertainty in your future career, teams, and organizations. If you are interested in this topic, you might start with Thinking in Systems: A Primer by Donella Meadows [184] and then read An Introduction to General Systems Thinking by Gerald Weinberg [279].

A brief introduction to feedback

The harder you push, the harder the system pushes back.
— Peter Senge
The Fifth Discipline

As the Senge quote implies, brute force does not scale well within the context of a system. One of the reasons for systems stability is feedback. Within the bounds of the system, actions lead to outcomes, which in turn affect future actions. This is a positive thing, as it is required to keep a complex operation on course.

Feedback is a loaded term. We hear terms like positive feedback and negative feedback and quickly associate it with performance coaching and management discipline. That is not the sense of feedback in this book. The definition of feedback as used in this book is based on engineering. There is a considerable related theory in general engineering and especially control theory, and the reader is encouraged to investigate some of these foundations if unfamiliar.

In Reinforcing feedback loop we see the classic illustration of a reinforcing feedback loop:

Figure 53. Reinforcing feedback loop

For example (as in Reinforcing (positive?) feedback, with rabbits), we can consider “rabbit reproduction” as a process with a reinforcing feedback loop.

Figure 54. Reinforcing (positive?) feedback, with rabbits

The more rabbits, the faster they reproduce, and the more rabbits. This is sometimes called a “positive” feedback loop, although Mr. MacGregor the local gardener may not agree, given that they are eating all his cabbages!! This is why feedback experts (e.g.,[258]) prefer to call this “reinforcing” feedback because there is not necessarily anything “positive” about it.

We can also consider feedback as the relationship between TWO processes (see Feedback between two processes).

feedback between 2 processes
Figure 55. Feedback between two processes

In our rabbit example, what if Process B is fox reproduction; that is, the birth rate of foxes (who eat rabbits) (Balancing (negative?) feedback, with rabbits and foxes)?

rabbits and foxes
Figure 56. Balancing (negative?) feedback, with rabbits and foxes

More rabbits equal more foxes (notice the “+” symbol on the line) because there are more rabbits to eat! But what does this do to the rabbits? It means FEWER rabbits (the “--” on the line). Which, ultimately, means fewer foxes, and at some point, the populations balance. This is classic negative feedback. However, the local foxes don’t see it as negative (nor do the local gardeners)! That is why feedback experts prefer to call this “balancing” feedback. Balancing feedback can be an important part of a system’s overall stability.

Wikipedia has good articles on Causal Loop Diagramming and Systems Dynamics (with cool dynamic visuals). [258] is the definitive text with applications.

Still confused? Think about the last time you saw a “reply-all” email storm. The first accidental mass send generates feedback (emails saying “take me off this list”), which generate more emails (“stop emailing the list”), and so on. This does not continue indefinitely; management intervention, common sense, and fatigue eventually damp the storm down.

What does systems thinking have to do with IT?

In an engineering sense, positive feedback is often dangerous and a topic of concern. A recent example of bad positive feedback in engineering is the London Millenium Bridge (see London Millenium Bridge [24]). On opening, the Millennium Bridge started to sway alarmingly, due to resonance and feedback which caused pedestrians to walk in cadence, increasing the resonance issues. The bridge had to be shut down immediately and retro-fitted with $9 million worth of tuned dampers [72].

collapsing bridge
Figure 57. London Millenium Bridge

As with bridges, at a technical level, reinforcing feedback can be a very bad thing in IT systems. In general, any process that is self-amplified without any balancing feedback will eventually consume all available resources, just like rabbits will eat all the food available to them. So, if you create a process (e.g.,write and run a computer program) that recursively spawns itself, it will sooner or later crash the computer as it devours memory and CPU. See Runaway processes.

Balancing feedback, on the other hand, is critical to making sure you are “staying on track.” Engineers use concepts of control theory, for example, damping, to keep bridges from falling down.

Remember in Chapter 1 we talked of the user’s value experience, and also how services evolve over time in a lifecycle? In terms of the dual-axis value chain, there are two primary digital value experiences:

— The value the user derives from the service (e.g.,account lookups, or a flawless navigational experience) — The value the investor derives from monetizing the product, or comparable incentives (e.g.,non-profit missions)

Additionally, the product team derives career value. This becomes more of a factor later in the game. We will discuss this further in Chapter 7 – on organization -— and Part IV, on architecture lifecycles and technical debt.

The product team receives feedback from both value experiences. The day-to-day interactions with the service (e.g.,help desk and operations) are understood, and (typically on a more intermittent basis) the portfolio investor also feeds back the information to the product team (the boss’s boss comes for a visit).

Balancing feedback in a business and IT context takes a wide variety of forms:

  • The results of a product test in the marketplace; for example, users' preference for a drop down box versus check boxes on a form

  • The product owner clarifying for developers their user experience vision for the product, based on a demonstration of developer work in process

  • The end users calling to tell you the “system is slow” (or down)

  • The product owner or portfolio sponsor calling to tell you they are not satisfied with the system’s value

In short, we see these two basic kinds of feedback:

  • Positive/Reinforcing, “do more of that”

  • Negative/Balancing, “stop doing that,” “fix that”

You should consider:

  • How you are accepting and executing on feedback signals

  • How the feedback relationship with your investors is evolving, in terms of your product direction

  • How the feedback relationship with your users is evolving, in terms of both operational criteria and product direction

One of the most important concepts related to feedback, one we will keep returning to, is that product value is based on feedback. We’ve discussed Lean Startup, which represents a feedback loop intended to discover product value. Don Reinertsen, whose work we will discuss in this chapter, has written extensively on the importance of fast feedback to the product discovery process.

Reinforcing feedback: the special case investors want

At a business level, there is a special kind of reinforcing feedback that defines the successful business (see The reinforcing feedback businesses want).

positive business feedback
Figure 58. The reinforcing feedback businesses want

This is reinforcing feedback and positive for most people involved: investors, customers, employees. At some point, if the cycle continues, it will run into balancing feedback:

  • Competition

  • Market saturation

  • Negative externalities (regulation, pollution, etc.).

But those are problems the business wants to have.

Open versus closed-loop systems

Finally, we should talk briefly about open-loop versus closed-loop systems.

  • Open-loop systems have no regulation, no balancing feedback

  • closed-loop systems have some form of balancing feedback

In navigation terminology, the open-loop attempt to stick to a course without external information (e.g.,navigating in the fog, without radar or communications) is known as dead reckoning, in part because it can easily get you dead!

A good example of an open-loop system is the children’s game “pin the tail on the donkey” (see Pin the tail on the donkey [25]). In “pin the tail on the donkey,” a person has to execute a process (pinning a paper or cloth “tail” onto a poster of a donkey — no live donkeys are involved!) while blindfolded, based on their memory of their location (and perhaps after being deliberately disoriented by spinning in circles). Since they are blindfolded, they have to move across the room and pin the tail without the ongoing corrective feedback of their eyes. (Perhaps they are getting feedback from their friends, but perhaps their friends are not reliable).

donkey game
Figure 59. Pin the tail on the donkey

Without the blindfold, it would be a closed-loop system. The person would rise from their chair and, through the ongoing feedback of their eyes to their central nervous system, would move towards the donkey and pin the tail in the correct location. In the context of a children’s game, the challenges of open-loop may seem obvious, but an important aspect of IT management over the past decades has been the struggle to overcome open-loop practices. Reliance on open-loop practices is arguably an indication of a dysfunctional culture. An IT team that is designing and delivering without sufficient corrective feedback from its stakeholders is an ineffective, open-loop system. Mark Kennaley [151] applies these principles to software development in much greater depth, and is recommended.

No system can ever be fully “open-loop” indefinitely. Sooner or later, you take off the blindfold or wind up on the rocks.

Engineers of complex systems use feedback techniques extensively. Complex systems do not work without them.


After the Korean War, the U.S. Air Force wished to clarify why its pilots had performed in a superior manner to the opposing pilots who were flying aircraft viewed as more capable. A colonel named John Boyd was tasked with researching the problem. His conclusions are based on the concept of feedback cycles, and how fast humans can execute them. Boyd determined that humans go through a defined process in building their mental model of complex and dynamic situations. This has been formalized in the concept of the OODA loop (see OODA loop [26])

OODA loop
Figure 60. OODA loop

OODA stands for:

  • Observe

  • Orient

  • Decide

  • Act

Because the U.S. fighters were lighter, more maneuverable, and had better visibility, their pilots were able to execute the OODA loop more quickly than their opponents, leading to victory. Boyd and others have extended this concept into various other domains including business strategy. The concept of the OODA feedback loop is frequently mentioned in presentations on Agile methods. Tightening the OODA loop accelerates the discovery of product value and is highly desirable.

The DevOps consensus as systems thinking

We covered continuous delivery and introduced DevOps in the previous chapter. Systems theory provides us with powerful tools to understand these topics more deeply.

change v stability
Figure 61. Change versus stability

One of the assumptions we encounter throughout digital management is the idea that change and stability are opposing forces. In systems terms, we might use a diagram like Change versus stability (see [26] for original exploration.]) As a Causal Loop Diagram (CLD), it is saying that Change and Stability are opposed — the more we have of one, the less we have of the other. This is true, as far as it goes — most systems issues occur as a consequence of change; systems that are not changed in general do not crash as much.

3 node CLD
Figure 62. Change vicious cycle

The trouble with viewing change and stability as diametrically opposed is that change is inevitable. If simple delaying tactics are put in, these can have a negative impact on stability, as in Change vicious cycle. What is this diagram telling us? If the owner of the system tries to prevent change, a larger and larger backlog will accumulate. This usually results in larger- and larger-scale attempts to clear the backlog (e.g.,large releases or major version updates). These are riskier activities which increase the likelihood of change failure. When changes fail, the backlog is not cleared and continues to increase, leading to further temptation for even larger changes.

How do we solve this? Decades of thought and experimentation have resulted in continuous delivery and DevOps, which can be shown in terms of system thinking in The DevOps consensus.

3 node CLD
Figure 63. The DevOps consensus

To summarize a complex set of relationships:

  • As change occurs more frequently, it enables smaller change sizes.

  • Smaller change sizes are more likely to succeed (as change size goes up, change success likelihood goes down; hence, it is a balancing relationship).

  • As change occurs more frequently, organizational learning happens (change capability). This enables more frequent change to occur, as the organization learns. This has been summarized as, “if it hurts, do it more” (Martin Fowler in [86]).

  • The improved change capability, coupled with the smaller perturbations of smaller changes, together result in improved change success rates.

  • Improved change success, in turn, results in improved system stability and availability, even with frequent changes. Evidence supporting this de facto theory is emerging across the industry and can be seen in cases presented at the DevOps Enterprise Summit and discussed in The DevOps Handbook [154].

Notice the reinforcing feedback loop (the “R” in the looped arrow) between change frequency and change capability. Like all diagrams, this one is incomplete. Just making changes more frequently will not necessarily improve the change capability; a commitment to improving practices such as monitoring, automation, and so on is required, as the organization seeking to release more quickly will discover.

4. Product Management

4.1. Introduction

Product management?

Why product management in a book on IT management? Those of you with industry experience, especially backgrounds in project-based enterprise software development, may be unfamiliar with the term. However, a focus on product development is one of the distinguishing features of Agile development, even if that development is taking place in a larger enterprise context.

As you grow your company, you are bringing more people in. You become concerned that they need to share the same vision that inspired you to create this company. This is the goal of product management as a formalized practice.

Product strategy was largely tacit in Part I. As the founder, you used product management and product discovery practices, and may well be familiar with the ideas here, but the assumption is that you did not explicitly formalize your approach to them. Now you need a more prescriptive and consistent approach to discovering, defining, designing, communicating, and executing a product vision across a diverse team.

In this chapter, we will define and discuss product management, and distinguish it from project and process management. We will cover how product teams are formed and what practices and attitudes you should establish quickly.

We will discuss a number of specific schools of thought and practices, including Gothelf’s Lean UX, Scrum, and more specific techniques for product “discovery.” Finally, we will discuss the concepts of design and design thinking.

4.1.1. Chapter 4 outline

  • Why product management?

    • The product vision

    • Defining product management

    • Process, project, and product management

    • Productization as a strategy at Amazon

  • Organizing the product team

    • The concept of collaboration

    • Lean UX

    • Scrum

    • More on product team roles

  • Product discovery

    • Formalizing product discovery

    • Product discovery techniques

    • Discovery and design

    • Design

  • Assorted topics in Product Management

4.1.2. Chapter 4 learning objectives

  • Define and distinguish product versus project and process management

  • Identify the key concerns of forming a collaborative product team

  • Describe current product-oriented practices, such as Lean UX and Scrum

  • Describe product design and discovery practices and concerns

4.2. Why product management?

4.2.1. The product vision

You should review the digital context material in Chapter 1.

Before work, before operations, there must be a vision of the product. You already established a preliminary vision in Chapter 1, but now as your organization grows, you need to consider further how you will sustain that vision and establish an ongoing capability for realizing it. Like many other topics in this book, product management is a significant field in and of itself. Historically, product management has not been a major theme in enterprise IT management. Digital changes this.

IT systems started by serving narrow purposes, often “back office” functions such as accounting or materials planning. Mostly, such systems were managed as projects assembled on a temporary basis, resulting in the creation of a system to be “thrown over the wall” to operations. Product management, on the other hand, is concerned with the entire lifecycle. The product manager (or owner, in Scrum terms) cares about the vision, its execution, the market reaction to the vision (even if an internal market), the health, care, and feeding of the product, and the product’s eventual sunset or replacement.

In the enterprise IT world, “third party” vendors (e.g.,IBM) providing the back-office systems had product management approaches, but these were external to the IT operations. Nor were IT-based product companies as numerous 40 years ago as they are today; as noted in Chapter 1, with digital transformation, the digital component of modern products continues to increase to the point where it’s often not clear whether a product is “IT” or not.

team meeting
Figure 64. Product design session

Reacting to market feedback and adapting product direction is an essential role of the product owner. In the older model, feedback was often unwelcome, as the project manager typically was committed to the open-loop dead reckoning of the project plan and changing scope or direction was seen as a failure, more often than not.

Now, it’s accepted that systems evolve, perhaps in unexpected directions. Rapidly testing, failing fast, learning, and pivoting direction are all part of the lexicon, at least for market-facing IT-based products. And even back-office IT systems with better understood scope are being managed more as systems (or products) with lifecycles, as opposed to transient projects. (See the Amazon discussion, below).

So, what is product management and what does it mean for your team? [27]

4.2.2. Defining product management

In order to define product management, we first need to define the product. In Chapter 1, we established that products are goods, services, or some combination, with some feature that provides value to some consumer. defines it thus:

[A Product is] A good, idea, method, information, object, or service created as a result of a process and serves a need or satisfies a want. It has a combination of tangible and intangible attributes (benefits, features, functions, uses) that a seller offers a buyer for purchase. For example, a seller of a toothbrush offers the physical product and also the idea that the consumer will be improving the health of their teeth. A good or service [must] closely meet the requirements of a particular market and yield enough profit to justify its continued existence.

Product management, according to the same source, is:

The organizational structure within a business that manages the development, marketing, and sale of a product or set of products throughout the product lifecycle. It encompasses the broad set of activities required to get the product to market and to support it thereafter.

Product management in the general sense often reports to the Chief Marketing Officer (CMO). It represents the fundamental strategy of the firm, in terms of its value proposition and viability. The product needs to reflect the enterprise’s strategy for creating and maintaining customers.

Product strategy for internally-facing products is usually not defined by the enterprise CMO. If it is a back-office product, then “business within a business” thinking may be appropriate. (Even the payroll system run by IT for HR is a “product,” in this view). In such cases, there still is a need for someone to function as an “internal CMO” to the external “customers.”

As a field, product management has a professional association, the Product Development and Marketing Association, which publishes an extensive and continuously-refined handbook, and supports local chapters, training and certification, and other activities typical of a mature professional organization.

With digital transformation, all kinds of products have increasing amounts of “IT” in them. This means that an understanding of IT, and ready access to any needed IT specialty skills, is increasingly important to the general field of product management. Product management includes research and development, which means that there is considerable uncertainty. This is of course also true of IT systems development.

Perhaps the most important aspect of product design is focusing on the user, and what they need. The concept of outcome is key. This is easier said than done. The general problem area is considered Marketing, a core business school topic. Entire books have been written about the various tools and techniques for doing this, from focus groups to ethnographic analysis.

However, Marty Cagan recommends distinguishing Product Management from Product Marketing. He defines the two as follows:

The product manager is responsible for defining—in detail—the product to be built and validating that product with real customers and users. The product marketing person is responsible for telling the world about that product, including positioning, messaging and pricing, managing the product launch, providing tools for the sales channel to market and sell the product, and for leading key programs such as online marketing and influencer marketing programs [50 pp. 10-11].

We discuss some criticisms of overly marketing-driven approaches below.

4.2.3. Process, project, and product management

In the remainder of this book, we will continually encounter three major topics:

  • Product Management (this chapter)

  • Process Management (covered in Chapter 7)

  • Project Management (covered in Chapters 7 and 8)

They have an important commonality: all of them are concepts for driving results across organizations. Here are some of the key differences between process, project, and product management in the context of digital services and systems:

Table 4. Process, project, and product management
Process Project Product


Deliverable oriented

Outcome oriented

Repeatable with a high degree of certainty

Executable with a medium degree of certainty

Significant component of research and development, less certain of outcome — empirical approaches required

Fixed time duration, relatively brief (weeks/months)

Limited time duration, often scoped to a year or less

No specific time duration; lasts as long as there is a need

Fixed in form, no changes usually tolerated

Difficult to change scope or direction, unless specifically set up to accommodate

Must accommodate market feedback and directional change

Used to deliver service value and operate system (the “Ops” in DevOps)

Often concerned with system design and construction, but typically not with operation (the “Dev” in DevOps)

Includes service concept and system design, construction, operations, and retirement (both “Dev” and “Ops”)

Process owners are concerned with adherence and continuous improvement of the process; otherwise can be narrow in perspective

Project managers are trained in resource and timeline management, dependencies & scheduling; they are not typically incented to adopt a long-term perspective

Product managers need to have project management skills as well as understanding market dynamics, feedback, building long-term organizational capability

Resource availability and fungibility is assumed

Resources are specifically planned for, but their commitment is temporary (team is “brought to the work”)

Resources are assigned long-term to the product (work is “brought to the team”)

The above distinctions are deliberately exaggerated, and there are of course exceptions (short projects, processes that take years). However, it is in the friction between these perspectives we see some of the major problems in modern IT management. For example, an activity which may be a one-time task or a repeatable process results in some work product—​perhaps an artifact (see Activities create work products).

activities-work products
Figure 65. Activities create work products

The consumer or stakeholder of that work product might be a Project Manager.

Project management includes concern for both the activities and the resources (people, assets, software) required to produce some deliverable (see Projects create deliverables with resources and activities).

Figure 66. Projects create deliverables with resources and activities

The consumer of that deliverable might be a product manager. Product management includes concern for projects and their deliverables, and their ultimate outcomes, either in the external market or internally (see Product management may use projects).

Figure 67. Product management may use projects

Notice that product management may directly access activities and resources. In fact, earlier-stage companies often do not formalize project management (see Product management sometimes does not use projects).

Figure 68. Product management sometimes does not use projects

In our scenario, you are now on a tight-knit, collaborative team. You should think in terms of developing and sustaining a product. However, projects still exist, and sometimes you may find yourself on a team that is funded and operated on that basis. You also will encounter the concept of “process” even on a single team; more on that in Chapter 5. We will go further into projects and process management in Part III.

4.2.4. Productization as a strategy at Amazon

Amazon (the online retailer) is an important influence in the modern trend towards product-centric IT management. First, the founder Jeff Bezos mandated that all software development should be service-oriented. That means that some form of standard Application Programming Interface (API) was required for all applications to communicate with each other. By some accounts, Bezos threatened to fire anyone who did not do this. Second, all teams are to assume that the functionality being built might at some point be offered to external customers [165].

Figure 69. Two pizzas, one team

Third, a widely reported practice at is the limitation of product teams to between 5-7 people, the number that can be fed by “two pizzas” (depending on how hungry they are) [106] (see Two pizzas, one team [28]). It has long been recognized in software and IT management that larger teams do not necessarily result in higher productivity. The best known statement of this is "Brooks' Law” from The Mythical Man-Month, that “adding people to a late project will make it later” [36].

Fred Brooks' The Mythical Man-Month, derived in part from his experiences leading the IBM OS-360 project, is one of the timeless classics in software engineering and IT management writing. Serious IT professionals, whether or not they are actually programmers, should have it on their bookshelves.

The reasons for “Brooks' Law” have been studied and analyzed (see e.g., [176, 59]) but in general, it is due to the increased communication overhead of expanded teams. Product design work (of which software development is one form) is creative and highly dependent on tacit knowledge, interpersonal interactions, organizational culture, and other “soft” factors. Products, especially those with a significant IT component, can be understood as socio-technical systems, often complex. This means that small changes to their components or interactions can have major effects on their overall behavior and value.

This, in turn, means that newcomers to a product development organization can have a profound impact on the product. Getting them “up to speed” with the culture, mental models, and tacit assumptions of the existing team can be challenging and rarely is simple. And the bigger the team, the bigger the problem. The net result of these two practices at Amazon (and now General Electric and many other companies) is the creation of multiple nimble services that are decoupled from each other, constructed and supported by teams appropriately sized for optimal high-value interactions.

4.3. Organizing the product team

happy faces
Figure 70. Psychological safety supports collaboration

As mentioned at the chapter outset, you are a team now. Your founder and co-founder have found enough interest to sustain a larger organization.

How are you going to organize? How are you going to work? Events move quickly, and you don’t have much time to think about these things. But getting things right at the team level is essential as your organization scales up. Bad habits (like accepting too much work in the system, or tolerating toxic individuals) will be more and more difficult to overcome as you grow.

4.3.1. The concept of collaboration

Individuals and interactions over processes and tools.

The most efficient and effective method of conveying information to and within a development team is a face-to-face conversation.
— Agile Manifesto
We will discuss culture in more depth in future chapters. But this chapter is the first discussion of “how are we with each other.” Culture requires attention at the earliest stages, as it can be very difficult to change later.

Team collaboration is one of the key values of Agile. The Agile Alliance states that:

A “team” in the Agile sense is a small group of people, assigned to the same project or effort, nearly all of them on a full-time basis.

Teams are multi-skilled, share accountability, and individuals on the team may play multiple roles [7]:

Face to face interactions, usually enabled by giving the team its own space, are seen as essential for collaboration. While there are various approaches to Agile, all concur that tight-knight, collaborative teams deliver the highest value outcomes. However, collaboration does not happen just because people are fed pizzas and work in a room together. Google has established that the most significant predictor of team performance is a sense of psychological safety (see sidebar). Research by Anita Woolley and colleagues suggests that three factors drive team performance are [287]:

  • Equal contribution to team discussions (no dominant individuals)

  • Emotional awareness — being able to infer other team members' emotional states

  • Teams with a higher proportion of women tend to perform better (the researchers inferred this was due to women generally having higher emotional awareness)

Other research shows that diverse teams and organizations are more innovative and deliver better results; such teams may tend to focus more on facts (as opposed to groupthink) [225]. Certainly, a sense of psychological safety (Psychological safety supports collaboration [29]) is critical to the success of diverse teams, who may come from different cultures and backgrounds that don’t inherently trust each other.

The collective problem-solving talent of a diverse group of individuals who are given space to self-organize and solve problems creatively is immense, and very possibly the highest value resource known to the modern organization.

Google’s Project Aristotle

Around 2012, Google became interested in answering the question:

What makes a Google team effective?

Based on 200+ interviews across 180+ teams, they determined that “Who is on a team matters less than how the team members interact, structure their work, and view their contributions.”

They identified five “key dynamics":

  1. Psychological safety: team members feel safe to take risks with each other

  2. Dependability: team members can be counted on

  3. Structure and clarity: roles, plans, and goals are clear

  4. Meaning of work: work is personally important

  5. Impact of work: the work matters

Of the 5, psychological safety was the most significant. Teams that cultivate this enable collaboration and creativity, which lead to product value and improved organizational performance [232].

We turn to two current schools of thought with much to say about collaboration: Lean UX and Scrum.

4.3.2. Lean UX

Lean UX is the practice of bringing the true nature of a product to light faster, in a collaborative, cross-functional way that reduces the emphasis on thorough documentation while increasing the focus on building a shared understanding of the actual product experience being designed.
— Jeff Gothelf
Lean UX

Lean UX is a term coined by author and consultant Jeff Gothelf [111], which draws on three major influences:

  • Design thinking

  • Agile software development

  • Lean Startup

We briefly discussed Lean Startup in Chapter 1, and the history and motivations for Agile software development in Chapter 3. We’ll look in more depth at product discovery techniques, and design and design thinking in the next chapter section. However, Lean UX has much to say about forming the product team, suggesting (among others) the following principles for forming and sustaining teams:

  • Dedicated, cross-functional teams

  • Outcome (not deliverable/output) focus

  • Cultivating a sense of shared understanding

  • Avoiding toxic individuals (so-called “rockstars, gurus, and ninjas”)

  • Permission to fail

(Other Lean UX principles such as small batch sizes and visualizing work will be discussed elsewhere; there is significant overlap between Lean UX and other schools of thought covered in this book).

Lean UX is an influential work among digital firms and summarizes modern development practices well, especially for small, team-based organizations with minimal external dependencies. It is a broad and conceptual, principles-based framework open for interpretation in multiple ways. We continue with more “prescriptive” methods and techniques, such as Scrum.

4.3.3. Scrum

Scrum is a lightweight framework designed to help small, close-knit teams of people develop complex products.
— Chris Sims/Hillary L. Johnson
Scrum: A Breathtakingly Brief and Agile Introduction
There Are No Tasks; There Are Only Stories.
— Jeff Sutherland
Scrum: The Art of Doing Twice the Work in Half the Time

One of the first prescriptive Agile methodologies you are likely to encounter as a practitioner is Scrum. There are many books, classes, and websites where you can learn more about this framework; [252] is a good brief introduction, and [233] is well suited for more in-depth study.

“Prescriptive” means detailed and precise. A doctor’s prescription is specific as to what medicine to take, how much, and when. A prescriptive method is similarly specific. “Agile software development” is not prescriptive, as currently published by the Agile Alliance; it is a collection of principles and ideas you may or may not choose to use.

By comparison, Scrum is prescriptive; it states roles and activities specifically, and trainers and practitioners, in general, seek to follow the method completely and accurately.

Scrum is appropriate to this chapter, as it is product-focused. It calls for the roles of:

  • Product owner

  • Scrum master

  • Team member

and avoids further elaboration of roles.

The Scrum product owner is responsible for holding the product vision and seeing that the team executes the highest value work. As such, the potential features of the product are maintained in a “backlog” that can be re-prioritized as necessary (rather than a large, fixed-scope project). The product owner also defines acceptance criteria for the backlog items. The Scrum master, on the other hand, acts as a team coach, “guiding the team to ever-higher levels of cohesiveness, self-organization, and performance” [252]. To quote Roman Pichler:

The product owner and Scrum master roles complement each other: The product owner is primarily responsible for the “what"—creating the right product. The Scrum master is primarily responsible for the “how"—using Scrum the right way [211 p. 9].

Scrum uses specific practices and artifacts such as sprints, standups, reviews, the above-mentioned concept of backlog, burndown charts, and so forth. We will discuss some of these further in Chapter 5 (Work Management) and Chapter 7 (Coordination) along with Kanban, another popular approach for executing work.

In Scrum, there are three roles:

  • The product owner sets the overall direction

  • The Scrum Master coaches and advocates for the team

  • The development team is defined as those who are committed to the development work

There are seven activities:

  • The “Sprint” is a defined time period, typically two to four weeks, in which the development team executes on an agreed scope

  • Backlog Grooming is when the product backlog is examined and refined into increments that can be moved into the sprint backlog

  • Sprint Planning is where the scope is agreed

  • The Daily Scrum is traditionally held standing up, to maintain focus and ensure brevity

  • Sprint Execution is the development activity within the sprint

  • Sprint Review is the “public end of the sprint” when the stakeholders are invited to view the completed work

  • The Sprint Retrospective is held to identify lessons learned from the sprint and how to apply them in future work

There are a number of artifacts:

  • The product backlog is the overall “to-do” list for the product.

  • The sprint backlog is the to-do list for the current sprint.

  • Potentially Shippable product Increment (PSI) is an important concept used to decouple the team’s development activity from downstream business planning. A PSI is a cohesive unit of functionality that could be delivered to the customer, but doing so is the decision of the product owner.

Scrum is well grounded in various theories (process control, human factors), although Scrum team members do not need to understand theory to succeed with it. Like Lean UX, Scrum emphasizes high-bandwidth collaboration, dedicated multi-skilled teams, a product focus, and so forth.

The concept of having an empowered product owner readily available to the team is attractive, especially for digital professionals who may have worked on teams where the direction was unclear. Roman Pichler identifies a number of common mistakes, however, that diminish the value of this approach [211 pp. 17-20]:

  • Product owner lacks authority

  • Product owner is overworked

  • Product ownership is split across individuals

  • Product owner is “distant” — not co-located or readily available to team

Sidebar: Scrum and shu-ha-ri

In the Japanese martial art of aikido, there is the concept of shu-ha-ri, a form of learning progression.

  • Shu: The student follows the rules of a given method precisely, without addition or alteration

  • Ha: The student learns theory and principle of the technique

  • Ri: The student creates own approaches and adapts technique to circumstance

Scrum at its most prescriptive can be seen as a shu-level practice; it gives detailed guidance that has been shown to work.

(See [99] and [64 pp. 17-18])

4.3.4. More on product team roles

Boundaries are provided by the product owner and often come in the form of constraints, such as: "I need it by June", "We need to reduce the per-unit cost by half", "It needs to run at twice the speed", or "It can use only half the memory of the current version".
— Mike Cohn
Succeeding with Agile Software Development Using Scrum

Marty Cagan suggests that the product team has three primary concerns, requiring three critical roles [50]:

  • Value: Product Owner/Manager

  • Feasibility: Engineering

  • Usability: User Experience Design

Jeff Patton represents these concepts as a Venn diagram (see The three views of the product team [30]).

venn diagram
Figure 71. The three views of the product team

Finally, a word on the product manager. Scrum is prescriptive around the product owner role, but does not identify a role for product manager. This can lead to two people performing product management: a marketing-aligned “manager” responsible for high-level requirements, with the Scrum “product owner” attempting to translate them for the team. Marty Cagan warns against this approach, recommending instead that the product manager and owner be the same person, separate from marketing [50 pp. 7-8].

In the next chapter, we will consider the challenge of product discovery -— at a product level, what practices do we follow to generate the creative insights that will result in customer value?

4.4. Product discovery

What is a product manager? A product manager is the one person in the whole organization who owns the product requirements effort. Requirements focus on the WHAT, which means it isn’t Development, which focuses on the HOW. And Marketing traditionally talks about the WHAT, but cannot necessarily decide what the WHAT should be. At least not at any useful level of detail [194].
— Jacques Murphy
Pragmatic Marketing

Now that we have discussed the overall concept of product management and why it is important, and how product teams are formed, we can turn more specifically to the topic of product discovery and design. We have previously discussed the overall digital business context, as a startup founder might think of the problem. But the process of discovery continues as the product idea is refined, new business problems are identified, and solutions (such as specific feature ideas) are designed and tested for outcomes.

In this book, we favor the idea that products are “discovered” as much or more than they are “designed.” But you will see both terms used throughout this chapter. See the parable the Flower and the Cog for an illustration of the difference.

The presence of a section entitled “product discovery” in this book is a departure from other IT management textbooks. “Traditional” models of IT delivery focus on projects and deliverables, concepts we touched on in the last chapter section but that we will not explore in depth until later in the book. However, the idea of “product discovery” within the large company is receiving more and more attention. Even large companies benefit when products are developed with tight-knit teams using fast feedback.

The term “intrapreneurship,” credited to Gifford Pinchot, means “entrepreneurship inside a large company.”

For our discussion here, the challenge with the ideas of projects and deliverables is that they represent approaches that are more open loop, or at least delayed in feedback. Design processes do not perform well when feedback is delayed. System intent, captured as a user story or requirement, is only a hypothesis until tested via implementation and user confirmation.

4.4.1. Formalizing product discovery

Case study: Amazon shopping cart recommendations:

A well known story of the power of experimentation is told by Greg Linden, who was a product developer for early versions of the Amazon shopping cart. Linden had an idea of making recommendations to people based on what was already in their shopping cart. (While this is common across e-commerce sites now, at one point it was a new idea). While grocery stores “recommnend” impulse purchases (candy, gum) in the checkout lane, an e-commerce provider can recommend anything in the store, so the idea is even more powerful. Linden developed a prototype, and while it got some favorable reactions, one Senior Vice President (SVP) was against it -— his view was that it might distract people and lead them to abandon the cart.

As Linden says, “I was forbidden to work on this any further.” But he went ahead and prepared the feature anyways. The SVP was furious, but Amazon already had a data-driven culture, and even senior executives couldn’t block tests. The feature was then pushed out to a small set of Amazon customers. In this way, they could compare the behavior of customers who did receive shopping cart recommendations to those who didn’t (otherwise known as a controlled experiment). The results were dramatic — the feature outperformed the control of not having it by such a large margin that, as Linden says, “not having it live was costing Amazon a notable chunk of change”.

It’s unknown what happened to the SVP. Challenging senior executives can be bad for your career, but if you find yourself in a place run by HiPPOs who don’t want to experiment, you might want to consider how long that organization will be in business [170].

It is humbling to see how bad experts are at estimating the value of features (us included).
— Ronnie Kohavi
Online Experimentation at Microsoft

In Chapter 3, we needed to consider the means for describing system intent. Even as a bare-bones startup, some formalization of this starts to emerge, at the very least in the form of test-driven development (see Product discovery tacit).

Figure 72. Product discovery tacit

But, the assumption in our emergence model is that more formalized product management emerges with the formation of a team. As a team, we now need to expand “upstream” of the core delivery pipeline, so that we can collaborate and discover more effectively. Notice the grey box in Product discovery explicit.

pipeline w product discovery
Figure 73. Product discovery explicit

The most important foundation for your newly formalized product discovery capability is that it must be empirical and hypothesis-driven. Too often, product strategy is based on the Highest Paid Person’s Opinion (HiPPO). (Beware of HiPPO-based product discovery [31]).

Figure 74. Beware of HiPPO-based product discovery

The problem with relying on “gut feeling” or personal opinions is that people -— regardless of experience or seniority -— perform poorly in assessing the likely outcome of their product ideas. Some well known research on this topic was conducted by Microsoft’s Ronny Kohavi. In this research, Kohavi and team determined that “only about 1/3 of ideas improve the metrics they were designed to improve” [160]. As background, the same report cites that:

  • "Netflix considers 90 % of what they try to be wrong”

  • “75 % of important business decisions and business improvement ideas either have no impact on performance or actually hurt performance” according to Qualpro (a consultancy specializing in controlled experiments)

It is, therefore, critical to establish a strong practice of data-driven experimentation when forming a product team and avoid any cultural acceptance of “gut feel” or deferring to HiPPOs. This can be a difficult transition for the company founder, who has until now served as the de facto product manager.

A useful framework, similar to Lean Startup is proposed by Spotify, in the “DIBB” model:

  • Data

  • Insight

  • Belief

  • Bet

Data leads to insight, which leads to a hypothesis that can be tested (i.e., “bet” on — testing hypotheses is not free). We discuss issues of prioritization further in Chapter 5, in the section on cost of delay.

Don Reinertsen (whom we will read more about in the next chapter) emphasizes that such experimentation is inherently variable. We can’t develop experiments with any sort of expectation that they will always succeed. We might run 50 experiments, and only have 2 succeed. But if the cost of each experiment is $10,000, and the two that succeeded earned us $1 million each, we gained:

$ 2,000,000
$ — 500,000
$ 1,500,000

Not a bad return on investment! (See [221], Chapter 4, for a detailed, mathematical discussion, based on options and information theory). Roman Pichler, in Agile Product Management with Scrum, describes “old-school” versus “new-school” product management as in Old school versus new school product management (summarized from [211], p. xii).

Table 5. Old school versus new school product management
Old school New school

Shared responsibility

Single product owner

Detached/distant product management

Product owner on the Scrum team

Extensive up-front research

Minimal up-front work to define rough vision

Requirements frozen early

Dynamic backlog

Late feedback due to lengthy release cycle

Early & frequent releases drive fast feedback, resulting in customer value

4.4.2. Product discovery techniques

There are a wide variety of techniques and even “schools” of product discovery and design; we will consider a few representatives in this chapter section. Of course, when you first started your journey in Chapter 1, you might also have used some of these techniques. But now that you are a team, you are formalizing and relying on these techniques. These techniques are not mutually exclusive; they may be complementary. But at the more detailed, digital product level, how do we develop hypotheses for testing, in terms of our products/services? We briefly mentioned User Story Mapping in our discussion of system intent. In product discovery terms, User Story Mapping is a form of persona analysis. But that is only one of many techniques. Roman Pichler mentions “Vision Box and Trade Journal Review” and the “Kano Model” [211 p. 39]. Here, let’s discuss:

  • “Jobs to be Done” analysis

  • Impact mapping

  • Business analysis & architecture

Jobs to be Done

Customers don’t want a quarter-inch drill. They want a quarter-inch hole.
— Theodore Levitt
If I’d asked the customer what they wanted, they would have said “faster horses.”
— Henry Ford

The “Jobs to be Done” framework was created by noted Harvard professor Clayton Christensen, in part as a reaction against conventional marketing techniques that:

"frame customers by attributes—using age ranges, race, marital status, and other categories that ultimately create products and entire categories too focused on what companies want to sell, rather than on what customers actually need" [62].

“Some products are better defined by the job they do than the customers they serve,” in other words. [268] This is in contrast to many kinds of business and requirements analysis that focus on identifying different user personas (e.g.,45-55 married Black woman with children in the house). Jobs to be Done advocates argue that “The job, not the customer, is the fundamental unit of analysis” and that customers “hire” products to do a certain job [60].

To apply the Job to be Done approach, Des Traynor suggests filling in the blanks in the following [268]:

Why do people hire your product?

People hire your product to do the job of -------— every ---------— when ----------. The other applicants for this job are --------, --------, and --------, but your product will always get the job because of --------.

Understanding the alternatives people have is key. It’s possible that the job can be fulfilled in multiple different ways. For example, people may want certain software run. This job can be undertaken through owning a computer (e.g.,having a data center). It can also be managed by hiring someone else’s computer (e.g.,using a cloud provider). Not being attentive and creative in thinking about the diverse ways jobs can be done places you at risk for disruption.

Impact mapping

Understanding the relationship of a given feature or component to business objectives is critical. Too often, technologists (e.g., software professionals) are accused of wanting “technology for technology’s sake.”

Showing the “line of sight” from technology to a business objective is, therefore, critical. Ideally, this starts by identifying the business objective. Gojko Adzic’s Impact Mapping: Making a big impact with software products and projects [5] describes a technique for doing so:

An impact map is a visualization of scope and underlying assumptions, created collaboratively by senior technical and business people.

Starting with some general goal or hypothesis (e.g.,generated through Lean Startup thinking), you build a “map” of how the goal can be achieved, or hypothesis can be measured. A simple graphical approach can be used, as in Impact map.

impact map
Figure 75. Impact map
Impact mapping is similar to mind mapping, and some drawing tools such as Microsoft Visio come with “Mind Mapping” templates.

The most important part of the impact map is to answer the question “Why are we doing this?” The impact map is intended to help keep the team focused on the most important objectives, and avoid less valuable activities and investments.

For example, in the above diagram, we see that a bank may have an overall business goal of customer retention. (It is much more expensive to gain a new customer than to retain an existing one, and retention is a metric carefully measured and tracked at the highest levels of the business).

Through focus groups and surveys, the bank may determine that staying current with online services is important to retaining customers. Some of these services are accessed by home PCs, but increasingly customers want access via mobile devices.

These business drivers lead to the decision to invest in online banking applications for both the Apple and Android mobile platforms. This decision, in turn, will lead to further discovery, analysis, and design of the mobile applications.

The Business Analysis Body of Knowledge

One well-established method for product discovery is that of business analysis, formalized in the Business Analysis Body of Knowledge (BABOK), from the International Institute of Business Analysis [134].

The Business Analysis Body of Knowledge (BABOK) defines business analysis as (p. 442):

The practice of enabling change in the context of an enterprise by defining needs and recommending solutions that deliver value to stakeholders.

BABOK is centrally concerned with the concept of requirements, and classifies them thus:

  • Business requirements

  • Stakeholder requirements

  • Solution requirements

    • Functional requirements

    • Non-functional requirements

  • Transition requirements

BABOK also provides a framework for understanding and managing the work of business analysts; in general, it assumes that a Business Analyst capability will be established and that maturing such a capability is a desirable thing. This may run counter to the Scrum ideal of cross-functional, multi-skilled teams. Also as noted above, the term "requirements” has fallen out of favor with some Agile thought leaders.

4.5. Product design

Design sign
Figure 76. Design
Everyone designs who devises courses of action aimed at changing existing situations into preferred ones [251].
— Herbert Simon
The art of making useful things beautiful and beautiful things useful.
— unknown

Once we have discovered at least a direction for the product’s value proposition, and have started to understand and prioritize the functions it must perform, we begin the activity of design. Design, like most other topics in this book, is a broad and complex area with varying definitions and schools of thought. The Herbert Simon quote at the beginning of this section is frequently cited.

Design is an ongoing theme throughout the humanities, encountered in architecture (the non-IT variety), art, graphics, fashion, and commerce. It can be narrowly focused, such as the question of what color scheme to use on an app or web page. Or it can be much more expansive, as suggested by the field of design thinking. We’ll start with the expansive vision and drill down into a few interesting topics [32].

4.5.1. Design thinking

Design thinking is essentially a human-centered innovation process that emphasizes observation, collaboration, fast learning, visualization of ideas, rapid concept prototyping, and concurrent business analysis, which ultimately influences innovation and business strategy. [172]
— Thomas Lockwood
Design Thinking

Design thinking is a recent trend with various definitions, but in general, combines a design sensibility with problem solving at significant scale. It usually is understood to include a significant component of systems thinking. As Tom Fisher, author of Designing Our Way to a Better World [91], notes:

We’ve been doing a lot of work in this area of “design thinking,” which takes the thought process and the methods that have been developed for millennia around the design of physical things — products, buildings, cities — and applies that to the so-called invisible world of design, which is all of the systems and organizations that are designed, but we don’t think of them as being designed. And we’re seeing a lot of these systems not working very well. [210]

Design thinking is the logical evolution of disciplines such as user interface design when such designs encounter constraints and issues beyond their usual span of concern. Although it has been influential on Lean UX and related works, it is not an explicitly digital discipline.

There are many design failures in digital product delivery. What is often overlooked is that the entire customer experience of the product is a form of design.

Consider for example Apple. Their products are admired worldwide and cited as examples of “good design.” Often, however, this is only understood in terms of the physical product, for example, an iPhone or a MacBook Air. But there is more to the experience. Suppose you have technical difficulties with your iPhone, or you just want to get more value out of it. Apple created its popular Genius Bar support service (see Apple Genius Bar [33]), where you can get support and instruction in using the technology.

Genius Bar
Figure 77. Apple Genius Bar

Notice that the product you are using is no longer just the phone or computer. It is the combination of the device PLUS your support experience. This is essential to understanding the modern practices of design thinking and Lean UX. As Jeff Sussna, author of Designing Delivery, notes, “In order to provide high-quality, digitally infused service, the entire delivery organization must function as an integrated whole.” [262 p. 18].

4.5.2. Hypothesis testing

The concept of hypothesis testing is key to product discovery and design. The power of scalable cloud architectures and fast continuous delivery pipelines has made it possible to test product hypotheses against real-world customers at scale and in real time. Companies like Netflix and Facebook have pioneered techniques like "canary deployments” and "A/B testing.”

In these approaches, two different features are tried out simultaneously, and the business results are measured. For example, are customers more likely to click on a green button or a yellow one? Testing such questions in the era of packaged software would have required lab-based usability engineering approaches, which risked being invalid because of their small sample size. Testing against larger numbers is possible, now that software is increasingly delivered as a service.

4.5.3. Usability and interaction

At a lower level than the holistic concerns of design thinking, we have practices such as usability engineering. These take many forms. There are many systematic and well-researched approaches to:

and related topics. All such approaches, however, should be used in the overall Lean Startup/Lean UX framework of hypothesis generation and testing. If we subscribe to design thinking and take a whole-systems view, designing for ease of operations is also part of the design process. We will discuss this further in Chapter 6. Developing documentation of the product’s characteristics, from the perspective of those who will run it on a day-to-day basis, is also an aspect of product delivery.

4.5.4. Parable: The Flower and the Cog


Hello. Where did you come from?

I fell. From that machine.


Yes, that big loud thing that just passed by. And is now stopped over there.

Why is it stopped?

Because I am no longer with it. The machine needs me to function. I am called a “cog.” Where did you come from?

I am a flower. I grew from a seed.

You …​ grew?


You mean, no-one planned or designed you?

Not that I know of. What does it mean to be “designed” or “planned"?

I am part of a greater whole. The need for me was understood when that greater whole was conceived. I was designed to fit a very particular place.

They had to try making me out of different metals, and different ways to make me. This took some time and effort -— longer than was planned, in fact. But it was always understood that there would need to be a cog in a certain place in the machine.

Interesting. So you will never be more than you are?

No. I will always be a cog. They might make a different machine, with different cogs, but they will not be me. Are you part of a machine?

No. I grew here because it suited me. I have continued to grow for a couple years. Eventually, I may grow 20 feet tall if the conditions remain good. I can adapt to other plants, and find my way around them to the sunlight and the water I need. Or I may stay smaller if I can’t get the sunlight I need. Or I may die.

Aren’t you part of a system that defines your purpose?

I don’t know. Sometimes I think I am a system myself, made up of my roots, stem, leaves, and flower. There are insects living on me who rely on me for food and shelter. And I have the freedom to grow into one of the largest trees in this area. That is worth it to me.

Interesting. Well, it is good you are growing where you are, and not 20 feet further in that direction.


Because when they find me, or replace me and fix the machine, it will continue to clear all the land over there.


VOICES: “Hey Joe, here’s that gear the tractor must have thrown.”

“Good, grab it, and I’ll see if I can’t get it back in place at least temporarily until we can figure out why it happened.”


Goodbye. Nice talking to you. Good luck.

Thanks. You too.

4.5.5. Product discovery versus design

Some of the most contentious discussions related to IT management and Agile come at the intersection of software and systems engineering, especially when large investments are at stake. We call this the “discovery versus design” problem.

Frequent criticisms of Lean Startup and its related digital practices are:

  • They are relevant only for non-critical Internet-based products (e.g.,Facebook and Netflix)

  • Some IT products must fit much tighter specifications and do not have the freedom to “pivot” (e.g.,control software written for aerospace and defense systems)

The parable in the previous section is meant to illustrate two very different product development worlds. Some product development is constrained by the overall system it takes place within. Other product development has more freedom to grow in different directions -— to “discover” the customer.

The cog represents the world of classic systems engineering -— a larger objective frames the effort, and the component occupies a certain defined place within it. And yet, it may still be challenging to design and build the component, which can be understood as a product in and of itself. Fast feedback is still required for the design and development process, even when the product is only a small component with a very specific set of requirements.

The flower represents the market-facing digital product that may “pivot,” grow and adapt according to conditions. It also is constrained, by available space and energy, but within certain boundaries has greater adaptability.

Neither is better than the other, but they do require different approaches. In general, we are coming from a world that viewed digital systems strictly as cogs. Subsequently, we are moving towards a world in which digital systems are more flexible, dynamic, and adaptable.

And, when digital components have very well understood requirements, usually we purchase them from specialist providers (increasingly “as a service”). This results in increasing attention to the “flowers” of digital product design since acquiring the “cogs” is relatively straightforward (more on this in the Chapter 8 section on sourcing).

4.6. Product planning

4.6.1. Product roadmapping and release planning

Some companies may notice that every specification they have ever prepared in the history of their company has had to change during product development, but they attribute this to human error and lack of discipline. They deeply believe that they can produce a good specification if only they try hard enough. You may recognize this as the classic thinking of an open-loop control system [220 p. 177] …
— Don Reinertsen
Managing the Design Factory

Creating effective plans in complex situations is hard. Planning a new product is one of the most challenging endeavors, one in which failure is common. As the Don Reinertsen quote above reflects, the historically failed approach (call it the "planning fallacy”) is to develop overly detailed (sometimes called “rigorous”) plans and then assume that achieving them is simply a matter of “correct execution” (see Planning fallacy).

Figure 78. Planning fallacy

Contrast the planning fallacy with Lean Startup, which emphasizes ongoing confirmation of product direction through experimentation. In complex efforts, ongoing validation of assumptions and direction is essential, which is why overly plan-driven approaches are falling out of favor. However, some understanding of timeframes and mapping goals against the calendar is still important. Exactly how much effort to devote to such forecasting remains a controversial topic with digital product management professionals, one we will return to throughout this book.

Minimally, a high-level product roadmap is usually called for: without at least this, it may be difficult to secure the needed investment to start product work. Roman Pichler recommends the product roadmap contain:

  • Major versions

  • Their projected launch dates

  • Target customers and needs

  • Top three to five features for each version [211 p. 41].

More detailed understanding is left to the product backlog, which is subject to ongoing “grooming”, that is, re-evaluation in light of feedback.

4.6.2. Backlog, estimation, and prioritization

… companies often create complex prioritization algorithms that produce precise priorities based on very imprecise input data. I prefer the simple approach. To select between two almost equivalent choices creates great difficulty and little payoff. Instead, we should try to prevent the big, stupid mistakes. This does not require precision [221 p. 192].
— Don Reinertsen
Principles of Product Development Flow

The product discovery and roadmapping activity ultimately generates a more detailed view or list of the work to be done. As we previously mentioned, in Scrum and other Agile methods this is often termed a backlog. Both Mike Cohn and Roman Pichler use the DEEP acronym to describe backlog qualities [67 p. 243, 211 p. 48]:

  • Detailed appropriately

  • Estimated

  • Emergent (feedback such as new or changed stories are readily accepted)

  • Prioritized

Figure 79. Backlog granularity and priority

The backlog should receive ongoing “grooming” to support these qualities, which means several things:

  • Addition of new items

  • Re-prioritization of items

  • Elaboration (decomposition, estimation, and refinement)

When “detailed appropriately,” items in the backlog are not all the same scale. Scrum and Agile thinkers generally agree on the core concept of "story,” but stories vary in size (see Backlog granularity and priority [34]), with the largest stories often termed “epics.” The backlog is ordered in terms of priority (what will be done next) but, critically, it is also understood that the lower-priority items, in general, can be larger grained. In other words, if we visualize the backlog as a stack, with the highest priority on the top, the size of the stories increases as we go down. (Reinertsen terms this progressive specification; see [220 pp. 176-177] for a detailed discussion).

Estimating user stories is a standard practice in Scrum and Agile methods more generally. Agile approaches are wary of false precision and accept the fact that estimation is an uncertain practice at best. However, without some overall estimate or roadmap for when a product might be ready for use, it is unlikely that the investment will be made to create it. It is difficult to establish the economic value of developing a product feature at a particular time if you have no idea of the cost and/or effort involved to bring it to market.

At a more detailed level, it is common practice for product teams to estimate detailed stories using “points.” Mike Cohn emphasizes, “Estimate size, derive duration” ([66], p. xxvii). Points are a relative form of estimation, valid within the boundary of one team. Story point estimating strives to avoid false precision, often restricting the team’s estimate of the effort to a modified Fibonacci sequence, or even T-shirt or dog sizes [66 p. 37] as shown in Agile estimating scales [35])].

Mike Cohn emphasizes that estimates are best done by the teams performing the work [66 p. 51]. We’ll discuss the mechanics of maintaining backlogs in Chapter 5, Work Management.

Table 6. Agile estimating scales
Story point T-shirt Dog












Border Collie






Labrador Retriever






Great Dane

Backlogs require prioritization. In order to prioritize, we must have some kind of common understanding of what we are prioritizing for. Mike Cohn, in Agile Estimating and Planning, proposes that there are four major factors in understanding product value:

  • The financial value of having the features

  • The cost of developing and supporting the features

  • The value of the learning created by developing the features

  • The amount of risk reduced by developing the features [66 p. 80]

In Chapter 5 we will discuss additional tools for managing and prioritizing work, and we will return to the topic of estimation in Chapter 8.

4.7. Conclusion

As with most other chapters in this book, this brief survey cannot do justice to such a broad field.

Product-centric thinking is at the core of delivering digital value. Without a clear understanding of the product and its context, investments of time and resources are at risk.

Product management requires close collaboration between individuals of different skills. Team structures, practices, and culture are critical factors for success. Establishing an experimental mindset is essential; research shows that product ideas must be validated as expert opinion is of minimal value.

Product discovery and design is a broad field; design thinking is an important influence, as design spans more than just the user interface.

4.7.1. Discussion questions

  • Can you think of examples from your own experience where you or someone else confused “output” (e.g.,deliverable) with “outcome"?

  • What is your experience with being on teams? Have you ever been on a team where people felt the psychological safety to take risks? Or the opposite?

  • Can you identify two or three distinct features (as defined above) for a commercial product such as Facebook or Netflix? Try to “reverse engineer” features that might have been individual areas of investment.

  • Can you think of any components (as defined above) you will need?

4.7.2. Research & practice

  • If you have team product ideas from previous discussions, consider them in terms of the Cagan/Patton questions and prepare an analysis:

    • Is it valuable?

    • Is it usable?

    • Is it feasible?

  • Identify and analyze a situation where “design” requires a broad, design-thinking understanding of both physical products as well as the service interactions surrounding it. Present to your class.

  • Review the Scaled Agile Framework’s descriptions of features and user stories. Document one feature consisting of several stories for your product.

  • Research Behavior-Driven Development and consider its potential value for you.

4.7.3. Further reading





5. Work Management

5.1. Introduction

As two more assistants were added, however, coordination problems did arise. One day, Miss Bisque tripped over a pail of glaze and broke five pots; another day, Ms. Raku opened the kiln to find that the hanging planters had all been glazed fuchsia by mistake. At this point, she realized that seven people in a small pottery studio could not coordinate all their work through the simple mechanism of informal communication.
— Henry Mintzberg
Structure in Fives

You’re a team. You need to get work done. When your startup hired its first employees, and you went from two people to three or four, you had to think about this. You are now getting feedback from both your users and your product team calling for prioritization, the allocation of resources, and the tracking of effort and completion. These are the critical day-to-day questions for any business larger than a couple of co-founders:

  • What do we need to do?

  • In what order?

  • Who is doing it?

  • Do they need help?

  • Do they need direction?

  • When will they be done?

  • What do we mean by done?

People have different responsibilities and specialties, yet there is a common vision for delivering an IT-based product of some value. How do you keep track of the work? Mostly, you are in the same location, but people sometimes are off-site or keeping different hours. Beyond your product strategy, you have support calls increasingly coming in that result in fixes and new features, and it’s all getting complicated.

You already have a product owner. You now institute Scrum practices of a managed backlog, daily standups, and sprints. You may also use Kanban-style task boards or card walls (to be described in this chapter), which are essential for things like support or other interrupt-driven work. The relationship of these to your Scrum practices is a matter of ongoing debate. You don’t think you need full-blown project management -— yet. You also may become aware of the idea of "ticketing,” if you have people with previous service desk experience. How this relates to your Scrum/Kanban approach is a question.

Furthermore, while we covered Agile principles and practices in some detail in Chapters 3 and 4, we did not discuss why they work. In this chapter, we will cover the Lean theory of product management that provides a basis for Agile practices, in particular, the work of Don Reinertsen.

The chapter title “Work Management” reflects earlier stages of organizational growth. At this point, neither formal project management, nor a fully realized process framework is needed, and the organization may not see a need to distinguish precisely between types of work processes. “It’s all just work” at this stage.

5.1.1. Chapter 5 outline

  • Introduction

  • Task management

  • Learning from manufacturing

    • Kanban and its Lean origins

    • The Theory of Constraints

    • Queues and limiting work in process

    • Scrum, Kanban, or both?

  • The shared mental model of the work to be done

    • Visualization

    • Andon, and the Andon Cord

    • Definition of done

    • Time and space shifting

  • The service desk and related tools

  • Lean product development and Don Reinertsen

    • Lean product development

    • Cost of delay

  • Advanced topics

    • Essay: Kanban, process, and respect for people

  • Conclusion

5.1.2. Chapter 5 learning objectives

  • Understand the basic concept of task management

  • Describe how to apply insights from manufacturing to digital product management correctly

  • Define Kanban and identify its key practices

  • Compare and contrast Scrum and Kanban

  • Understand concepts from Lean Product Management

5.2. Task management

Task management is the process of managing a task through its lifecycle. It involves planning, testing, tracking, and reporting. Task management can help individuals achieve goals, or groups of individuals collaborate and share knowledge for the accomplishment of collective goals.
— Wikipedia
Task Management

Product development drives a wide variety of work activities. As your product matures, you encounter both routine and non-routine work. Some of the work depends on other work getting done. Sometimes you do not realize this immediately. All of this work needs to be tracked.

Work management may start with verbal requests, emails, even postal mail. If you ask your colleague to do one thing, and she doesn’t have anything else to do, it’s likely that the two of you will remember. If you ask her to do four things over a few days, you might both still remember. But if you are asking for new things every day, it is likely that some things will get missed. You each might start keeping your own “to do” list, and this mechanism can handle a dozen or two dozen tasks. Consider an example of three people, each with their own to do list (see Work flowing across three to-do lists).

3 lists
Figure 80. Work flowing across three to-do lists

In this situation, each person has their own “mental model” of what needs to be done, and their own tracking mechanism. We don’t know how the work is being transmitted: emails, phone calls, hallway conversations. ("Say, Joe, there is an issue with Customer X’s bill, can you please look into it?”)

Figure 81. Task handoffs present risk

But what happens when there are three of you? Mary asks Aparna to do something, and in order to complete it, she needs something from Joe, whom Mary is also asking to complete other tasks. As an organization scales, this can easily lead to confusion and “dropped balls” (see Task handoffs present risk [36]).

At some point, you need to formalize your model of the work, how it is described, and how it flows. This is such a fundamental problem in human society that many different systems, tools, and processes have been developed over the centuries to address it.

Probably the most important is the shared task reference point. What does this mean? The “task” is made “real” by associating it with a common, agreed artifact.

For example, a “ticket” may be created, or a "work order.” Or a “story,” written down on a sticky note. At our current level of understanding, there is little difference between these concepts. The important thing they have in common is an independent existence. That is, Mary, Aparna, and Joe might all change jobs, but the artifact persists independently of them. Notice also that the artifact — the ticket, the post-it note — is not the actual task, which is an intangible, consensus concept. It is a representation of this intangible “intent to perform.” We will discuss these issues of representation further in Chapter 11.

A complex IT-based system is not needed if you are all in the same room! (Nor for that matter a complex process framework, such as ITIL or CObIT. There is a risk in using such frameworks at this stage of evolution — they add too much overhead for your level of growth). It’s also still too early for formal project management. The “project team” would be most or all of the organization, so what would be the point? A shared white board in a public location might be all that is needed (see Common list). This gives the team a “shared mental model” of who is doing what.

common list
Figure 82. Common list

The design of the task board above has some issues, however. After the team gets tired of erasing and rewriting the tasks and their current assignments, they might adopt something more like this:

basic Kanban
Figure 83. Simple task board

The board itself might be a white board or a cork bulletin board with push pins (see Simple task board). The notes could be sticky, or index cards. There are automated solutions as well. The tool doesn’t really matter. The important thing is that, at a glance, the entire team can see its flow of work and who is doing what.

Scrum Board card wall This is sometimes called a “Kanban board,” although David Anderson (originator of the Kanban software method [14]) himself terms the basic technique a "card wall.” It also has been called a "Scrum Board.” The board at its most basic is not specific to either methodology. The term “Kanban” itself derives from Lean manufacturing principles; we will cover this in depth in the next section. The basic board is widely used because it is a powerful artifact. Behind its deceptive simplicity are considerable industrial experience and relevant theory from operations management and human factors. However, it has scalability limitations. What if the team is not all in the same room? We will cover this and related issues in Part III.

The card wall or Kanban board is the first channel we have for demand management. Demand management is a term meaning “understanding and planning for required or anticipated services or work.” Managing day-to-day incoming work is a form of demand management. Capturing and assessing ideas for next year’s project portfolio (if you use projects) is also demand management at a larger scale.

5.3. Learning from manufacturing

Instructor’s note

The concepts of queuing and work in process are critical to the rest of this book. Recommend classroom exercises and additional reading to ensure that they are well understood by students. The Phoenix Project and The Goal are excellent, entertaining books that use novelization to illustrate these principles.

Manufacturing? What do digital professionals have to learn from that? “Our work is not an assembly line!” is one frequently heard response. It is true that we need to be careful in drawing the right lessons from manufacturing, but there is a growing consensus on how to do this. Please keep an open mind as you read this chapter.

5.3.1. Kanban and its Lean origins

Figure 84. Lean pioneer Taichi Ohno

To understand Kanban let’s discuss Lean briefly. We’ve had a passing mention of Lean already in this book. But what is it?

Lean is important. Regardless of your intended career path, it is advisable to read the great Lean classics, including The Machine That Changed the World, Lean Thinking, The Toyota Way, and Ohno’s own Toyota Production System. Toyota Kata is a more recent, in-depth analysis of Toyota’s culture.

Lean is a term invented by American researchers who investigated Japanese industrial practices and their success in the 20th century. After the end of World War II, no-one expected the Japanese economy to recover the way it did. The recovery is credited to practices developed and promoted by Taiichi Ohno and Shigeo Shingo at Toyota (see Lean pioneer Taichi Ohno [37]). These practices included:

  • Respect for people

  • Limiting work in process

  • Small batch sizes (driving towards “single piece flow”)

  • Just-in-time production

  • Decreased cycle time

Credit for Lean is also sometimes given to U.S. thinkers such as W.E. Deming, Peter Juran, and the theorists behind the Training Within Industry methodology, each of whom played influential roles in shaping the industrial practices of post-war Japan.

Kanban is a term originating from Lean and the Toyota Production System. Originally, it signified a “pull” technique in which materials would only be transferred to a given workstation on a definite signal that the workstation required the materials. This was in contrast to “push” approaches where work was allowed to accumulate on the shop floor, on the (now discredited) idea that it was more “efficient” to operate workstations at maximum capacity.

Factories operating on a “push” model found themselves with massive amounts of inventory (work in process) in their facilities. This tied up operating capital and resulted in long delays in shipment. Japanese companies did not have the luxury of large amounts of operating capital, so they started experimenting with "single-piece flow.” This led to a number of related innovations, such as the ability to re-configure manufacturing machinery much more quickly than U.S. factories were capable of.

David J. Anderson was a product manager at Microsoft who was seeking a more effective approach to managing software development. In consultation with Don Reinertsen (introduced below) he applied the original concept of Kanban to his software development activities [14].

Scrum (covered in the previous chapter) is based on a rhythm with its scheduled sprints, for example, every two weeks. In contrast, Kanban is a continuous process with no specified rhythm (also known as cadence). Work is “pulled” from the backlog into active attention as resources are freed from previous work. This is perhaps the most important aspect of Kanban — the idea that work is not accepted until there is capacity to perform it.

You may have a white board covered with sticky notes, but if they are stacked on top of each other with no concern for worker availability, you are not doing Kanban. You are accepting too much work in process, and you are likely to encounter a “high-queue state” in which work becomes slower and slower to get done. (More on queues below).

5.3.2. The Theory of Constraints

Eliyahu Moshe Goldratt was an Israeli physicist and management consultant, best known for his pioneering work in management theory, including The Goal, which is a best-selling business novel frequently assigned in MBA programs. It and Goldratt’s other novels have had a tremendous effect on industrial theory, and now, digital management. One of the best known stories in The Goal centers around a Boy Scout march. Alex, the protagonist struggling to save his manufacturing plant, takes a troop of Scouts on a ten mile hike. The troop has hikers of various speeds, yet the goal is to arrive simultaneously. As Alex tries to keep the Scouts together, he discovers that the slowest, most overweight scout (Herbie) also has packed an unusually heavy backpack. The contents of Herbie’s pack are redistributed, speeding up both Herbie and the troop.

Charles Betz note: Gene Kim and The Phoenix Project
Gene Kim
Figure 85. Gene Kim

Between 2005 and 2012, I was a lead Enterprise Architect at Wells Fargo Bank, primarily concerned with IT delivery capabilities such as portfolio and service management. One day around 2007, I arrived at my office to find an envelope from my friend Gene Kim, then CTO of Tripwire. Gene and I had been corresponding for some years on high-performing IT, IT process improvement, and related topics. In the envelope was a copy of a book called The Goal, by Eli Goldratt [109]. I was a little mystified, but after reading the book, I began to understand.

Gene saw the potential of the Theory of Constraints in understanding certain aspects of IT management and used it as a template to write another remarkable and influential book, The Phoenix Project [153]. Rather than a manufacturing plant, the Phoenix Project centers on the struggles of the IT team at a medium-sized automotive parts manufacturer and retailer. From a state of chaos, uncontrolled work in process and resource constraints, the team applies Lean, Agile, and DevOps techniques to great effect. In my view, The Phoenix Project is one of the most important works in the history of IT and digital management – in addition to being an enjoyable novel. I am honored to have been one of the original reviewers. If you are considering a career in IT or digital, it is essential reading. See especially Chapter 30 for an interesting discussion of manufacturing lessons in an IT context.

This story summarizes the Goldratt approach: finding the “constraint” to production (his work as a whole is called the Theory of Constraints). In Goldratt’s view, a system is only as productive as its constraint. At Alex’s factory, it’s found that the “constraint” to the overall productivity issues is the newest computer-controlled machine tool -— one that could (in theory) perform the work of several older models but was now jeopardizing the entire plant’s survival. The story in this novelization draws important parallels with actual Lean case studies on the often-negative impact of such capital-intensive approaches to production.

There is a tremendous wealth of material available on Lean history and theory, and the IT student is urged to become familiar with it. Often, IT professionals resist drawing lessons from non-IT fields, because of a perception that these other fields (especially manufacturing) are “deterministic” while IT systems development is too “uncertain.” In reality, manufacturing is less deterministic than IT professionals often perceive, while software development is, at its most uncertain, just another form of new product development, and/or research and development and therefore can be managed on that basis. Furthermore, much IT work is not R&D (e.g.,infrastructure provisioning), and that kind of work is even more suited for the application of manufacturing insights.

5.3.3. Queues and limiting work in process

Queues matter because they are economically important, they are poorly managed, and they have the potential to be much better managed. Queues profoundly affect the economics of product development. They cause valuable work products to sit idle, waiting to access busy resources. This idle time increases inventory, which is the root cause of many other economic problems; queues hurt cycle time, quality, and efficiency.
— Don Reinertsen
Principles of Product Development Flow
Figure 86. A queue

Even at this stage of our evolution, with just one co-located collaborative team, it’s important to consider work in process and how to limit it. One topic we will emphasize throughout the rest of this book is queuing. What is a queue? (see A queue [38]). A queue, intuitively, is a collection of tasks to be done, being serviced by some worker or resource in some sequence, for example:

  • Feature “stories” being developed by a product team

  • Customer requests coming into a service desk

  • Requests from a development team to an infrastructure team for services (e.g.,network or server configuration, consultations, etc.)

Queuing theory is an important branch of mathematics used extensively in computing, operations research, networking, and other fields. It’s a topic getting much attention of late in the Agile and related movements, especially as it relates to digital product team productivity.

The amount of time that any given work item spends in the queue is proportional to how busy the servicing resource is. The simple formula, known as Little’s Law, is:

Wait time = (% Busy) / (% Idle)

In other words, if you divide the percentage of busy time for the resource by its idle time, you see the average wait time. So, if a resource is busy 40% of the days, but idle 60% of the days, the average time you wait for the resource is:

0.4/0.6= 0.67 hours (2/3 of a day)

Conversely, if a resource is busy 95% of the time, the average time you’ll wait is:

0.95/0.05 = 5.67 (19 days!)

If you use a graphing calculator, you see the results in Time in queue increases exponentially with load.

wait time
Figure 87. Time in queue increases exponentially with load

Notice how the wait time approaches infinity as the queue utilization approaches 100%. And yet, full utilization of resources is often sought by managers in the name of “efficiency.” These basic principles are discussed by Gene Kim et al. in the Phoenix Project [153] in Chapter 23, and more rigorously by Don Reinertsen in The Principles of Product Development Flow [221], Chapter 3. A further complication is when work must pass through multiple queues; wait times for work easily expand to weeks or months. Such scenarios are not hypothetical, they are often seen in the real world and are a fundamental cause of IT organizations getting a bad name for being slow and unresponsive. Fortunately, digital professionals are gaining insight into these dynamics and (as of 2016) matters are improving across the industry.

Understanding queuing behavior is critical to productivity. Reinertsen suggests that poorly managed queues contribute to:

  • Longer cycle time

  • Increased risk

  • More variability

  • More overhead

  • Lower quality

  • Reduced motivation

These issues were understood by the pioneers of Lean manufacturing, an important movement throughout the 20th century. One of its central principles is to limit work in process. Work in process is obvious on a shop floor because physical raw materials (inventory) are quite visible (see Physical work in process [39]).

work in process
Figure 88. Physical work in process

Don Reinertsen developed the insight that product design and development had an invisible inventory of “work in process” that he called Design in Process (DIP). Just as managing physical work in process on the factory floor is key to a factory’s success, so correctly understanding and managing design in process is essential to all kinds of research and development organizations -— including digital product development, e.g., building software(!). In fact, because digital systems are largely invisible even when finished, understanding their work in process is even more challenging.

It is easy and tempting for a product development team to accumulate excessive amounts of work in process. And, to some degree, having a rich backlog of ideas is an asset. But, just as some inventory (e.g.,groceries) is perishable, so are design ideas. They have a limited time in which they might be relevant to a customer or a market. Therefore, accumulating too many of them at any point in time can be wasteful.

What does this have to do with queuing? DIP is one form of queue seen in the digital organization. Other forms include unplanned work (incidents and defects), implementation work, and many other concepts we’ll discuss in this chapter.

Regardless of whether it is a “Requirement,” a “User Story,” an “Epic,” “Defect," “Issue,” or “Service Request,” you should remember it’s all just work. It needs to be logged, prioritized, assigned, and tracked to completion. Queues are the fundamental concept for doing this, and it’s critical that digital management specialists understand this.

These concepts of work in process and queuing are the basis for much of the rest of this book. Be sure you are completely comfortable with them.

Finally, some rules of thumb:

  • Finish what you start, if you can, before starting anything else. When you work on three things at once, the multi-tasking wastes time, and it takes you three times longer to get any one of the things done. (More on multi-tasking in this chapter).

  • Infinitely long to-do lists (backlog) sap motivation. Consider limiting backlog as well as work in process.

  • Visibility into work in process is important for the collective mental model of the team.

There are deeper philosophical and cultural qualities to Kanban beyond workflow and queuing. Anderson and his colleagues continue to evolve Kanban into a more ambitious framework. Mike Burrows [45] identifies the following key principles:

  • Start with what you do now

  • Agree to pursue evolutionary change

  • Initially, respect current processes, roles, responsibilities, and job titles

  • Encourage acts of leadership at every level in your organization —from individual contributor to senior management

  • Visualize

  • Limit Work-in-Progress (WIP)

  • Manage flow

  • Make policies explicit

  • Implement feedback loops

  • Improve collaboratively, evolve experimentally (using models and the scientific method)

5.3.4. Multi-tasking

Figure 89. Multi-tasking destroys productivity

Multi-tasking (in this context) is when a human attempts to work on diverse activities simultaneously; for example developing code for a new application while also handling support calls). There is broad agreement that multi-tasking destroys productivity, and even mental health [57]. Therefore, minimize multi-tasking. Multi-tasking in part emerges as a natural response when one activity becomes blocked (e.g.,due to needing another team’s contribution). Approaches that enable teams to work without depending on outside resources are less likely to promote multi-tasking. Queuing and work in process thus become even more critical topics for management concern as activities scale up [40].

5.3.5. Scrum, Kanban, or both?

So, do you choose Scrum, Kanban, both, or neither? We can see in comparing Scrum and Kanban that their areas of focus are somewhat different:

  • Scrum is widely adopted in industry and has achieved a level of formalization, which is why Scrum training is widespread and generally consistent in content.

  • Kanban is more flexible but this comes at the cost of more management overhead. It requires more interpretation to translate to a given organization’s culture and practices.

  • As Scrum author Ken Rubin notes, “Scrum is not well suited to highly interrupt-driven work” [233]. Scrum on the service desk doesn’t work. (But if your company is too small, it may be difficult to separate out interrupt-driven work! We will discuss the issues around interrupt-driven work further in Chapter 6).

  • Finally, hybrids exist (Ladas' “Scrumban” [164]).

Ultimately, instead of talking too much about “Scrum” or “Kanban,” the student is encouraged to look more deeply into their fundamental differences. We will return to this topic in the section on Lean Product Development.

5.4. The shared mental model of the work to be done

Joint activity depends on interpredictability of the participants’ attitudes and actions. Such interpredictability is based on common ground — pertinent knowledge, beliefs, and assumptions that are shared among the involved parties. [155]
— Gary Klein et al.
“Common Ground and Coordination in Joint Activity"

The above quote reflects one of the most critical foundations of team collaboration: a common ground, a base of “knowledge, beliefs, and assumptions” enabling collaboration and coordination. Common ground is an essential quality of successful teamwork, and we will revisit it throughout the book. There are many ways in which common ground is important, and we will discuss some of the deeper aspects in terms of information in Chapter 11. Whether you choose Scrum, Kanban, or choose not to label your work management at all, the important thing is that you are creating a shared mental model of the work: its envisioned form and content, and your progress towards it.

In this section, we’ll discuss:

  • Visualization of work

  • The concept of Andon

  • The definition of done

  • Time and space shifting

Visualization is a good place to introduce the idea of common ground.

5.4.1. Visualization

As simple as the white board is, it makes WIP continuously visible, it enforces WIP constraints, it creates synchronized daily interaction, and it promotes interactive problem solving. Furthermore, teams evolve methods of using white boards continuously, and they have high ownership in their solution. In theory, all this can be replicated by a computer system. In practice, I have not yet seen an automated system that replicates the simple elegance and flexibility of a manual system.
— Don Reinertsen
Principles of Product Development Flow

Why are shared visual representations important? Depending on how you measure, between 40% to as much as 80% of the human cortex is devoted to visual processing. Visual processing dominates mental activity, consuming more neurons than the other four senses combined [243]. Visual representations are powerful communication mechanisms, well suited to our cognitive abilities.

This idea of common ground, a shared visual reference point, informing the mental model of the team, is an essential foundation for coordinating activtity (see Two people and a Kanban board [41]).

two people card wall
Figure 90. Two people and a Kanban board

This is why card walls or Kanban boards located in the same room are so prevalent. They communicate and sustain the shared mental model of a human team. A shared card wall, with its two dimensions and tasks on cards or sticky notes, is more informative than a simple to-do list (e.g., in a spreadsheet). The cards occupy two dimensional space and are moved over time to signify activity, both powerful cues to the human visual processing system.

Similarly, monitoring tools for systems operation make use of various visual clues. Large monitors may be displayed prominently on walls so that everyone can understand operational status. Human visual orientation is also why Enterprise Architecture persists. People will always draw to communicate. (More on visualization and Enterprise Architecture in Chapter 12).

Card walls and publicly displayed monitors are both examples of information radiators. The information radiator concept derives from the Japanese concept of Andon, important in Lean thinking.

5.4.2. Andon, and the Andon cord

Andon is a manufacturing term referring to a system to notify management, maintenance, and other workers of a particular quality or process problem. The centrepiece is a signboard incorporating signal lights to indicate which workstation has the problem. The alert can be activated manually by a worker using a pullcord or button or may be activated automatically by the production equipment itself. The system may include a means to stop production so the issue can be corrected. Some modern alert systems incorporate audio alarms, text, or other displays.
— Wikipedia

The Andon Cord (not to be confused with Andon in the general sense) is another well known concept in Lean Manufacturing. It originated with Toyota, where line workers were empowered to stop the production line if any defective materials or assemblies were encountered. Instead of attempting to work with the defective input, the entire line would shut down, and all concerned would establish what had happened and how to prevent it. The concept of andon cord concisely summarizes the Lean philosophy of employee responsibility for quality at all levels [200]. Where Andon is a general term for information radiator, the Andon cord implies a dramatic reponse to the problems of flow — all progress is stopped, everywhere along the line, and the entire resources of the production line are marshalled to collaboratively solve the issue so that it does not happen again. As Toyota thought leader Taiichi Ohno states:

Stopping the machine when there is trouble forces awareness on everyone. When the problem is clearly understood, improvement is possible. Expanding this thought, we establish a rule that even in a manually operated production line, the workers themselves should push the stop button to halt production if any abnormality appears.
— Taiichi Ohno

Andon and information radiators provide an important stimulus for product teams, informing priorities and prompting responses. They do not prescribe what is to be done; they simply indicate an operational status that may require attention.

5.4.3. Definition of done

As work flows through the system performing it, understanding its status is key to managing it. One of the most important mechanisms for doing this is to define what is meant by “done simply.” The Agile Alliance states:

“The team agrees on, and displays prominently somewhere in the team room, a list of criteria which must be met before a product increment, often a user story, is considered “done” [7]. Failure to meet these criteria at the end of a sprint normally implies that the work should not be counted toward that sprint’s velocity.” There are various patterns for defining “done,” for example, Thoughtworks recommends that the business analyst and developer both must agree that some task is complete (it is not up to just one person). Other companies may require peer code reviews [196]. The important point is that the team must agree on the criteria.

This idea of defining “done” can be extended by the team to other concepts such as “blocked.” The important thing is that this is all part of the team’s shared mental model, and is best defined by the team and its customers. (However, governance and consistency concerns may arise if teams are too diverse in such definitions).

5.4.4. Time and space shifting

At some point, your team will be faced with the problems of time and/or space shifting. People will be on different schedules, or in different locations, or both. There are two things we know about such working relationships. First, they lead to sub-optimal team communications and performance. Second, they are inevitable.

The need for time and space shifting is one of the major drivers for more formalized IT systems. It is difficult to effectively use a physical Kanban board if people aren’t in the office. The outcome of the daily standup needs to be captured for the benefit of those who could not be there.

However, acceptance of time and space shifting may lead to more of it, even when it is not absolutely required. Constant pressure and questioning are recommended, given the superior bandwidth of face-to-face communication in the context of team collaboration.

But not all work requires the same degree of collaboration. While we are still not ready for full scale process management, at this point in our evolution, we likely will encounter increasing needs to track customer or user service interactions, which can become quite numerous even for small, single-team organizations. Such work is often more individualized and routine, not requiring the full bandwidth of team collaboration. We’ll discuss this further with the topic of the Help or Service Desk, later in this chapter.

5.5. Lean product development and Don Reinertsen

We introduce Lean Product Development here and not in Chapter 4 because it is more about how the work is done, rather than what the content is. In particular, it requires an understanding of queuing.

5.5.1. Product development versus production

The idea of software development as an assembly line manned by semi-skilled interchangeable workers is fundamentally flawed and wasteful [261].
— Bjarne Stroustrup
C++ inventor

One of the challenges with applying Lean to IT (as noted previously) is that many IT professionals (especially software developers) believe that manufacturing (see Production [42]) is a “deterministic” field, whose lessons don’t apply to developing technical products. “Creating software is like creating art, not being on an assembly line,” is one line of argument.

assembly line
Figure 91. Production

The root cause of this debate is the distinction between product development and production. It is true that an industrial production line, for example producing forklifts by the thousands, may be repetitive. But how did the production line come to be? How was the forklift invented, or developed? It was created as part of a process of product development. It took mechanical engineering, electrical engineering, chemistry, materials science, and more. Combining fundamental engineering principles and techniques into a new, marketable product is not a repetitive process; it is a highly variable, creative process, and always has been.

Never confuse production with product development. The approaches, measures, and concerns are radically different. This book considers it an “original sin.”
Research and development
Figure 92. Research and development

One dead end that organizations keep pursuing is the desire to make research and development R&D, more “predictable"; that is, to reduce variation and predictably create innovations (Research and development [43]). This never works well; game-changing innovations are usually complex responses to complex market systems dynamics, including customer psychology, current trends, and many other factors. The process of innovating cannot, by its very nature, be made repeatable.

The concept of a repeatable “software process” is at best a risky one, and an idea that has caused much waste in the industry.

In IT, simply developing software for a new problem (or even new software for an old problem) is an R&D problem, not a production line problem. It is iterative, uncertain, and risky, just like other forms of product development. That does not mean it is completely unmanageable, or that its creation is a mysterious, artistic process. It is just a more variable process with a higher chance of failure, and with a need to incorporate feedback quickly to reduce the risk of open-loop control failure. These ideas are well known to the Agile community and its authors. However, there is one thought leader who stands out in this field: ex-Naval officer and nuclear engineer named Donald Reinertsen who was introduced in our previous discussions on beneficial variability in product discovery and queuing.

There are many books on Agile development and management. Consider the following books, all well known and influential in the fields of software development and Agile:

What do they all have in common? They all cite Don Reinertsen. Reinertsen’s work dates back to 1991, and (orginally as a co-author with Preston G. Smith) presaged important principles of the Agile movement [254], from the general perspective of product development. Reinertsen’s influence is well documented and notable. He was partnering with David Anderson when Anderson created the “software Kanban” approach. He wrote the introduction to Leffingwell’s Agile Software Requirements, the initial statement of the Scaled Agile Framework. His influence is pervasive in the Agile community. His work is deep and based on fundamental mathematical principles such as queueing theory. His work can be understood as a series of interdependent principles:

  1. The flow or throughput of product innovation is the primary driver of financial success. (Notice that innovation must be accepted by the market — simply producing a new product is not enough).

  2. Product development is essentially the creation of information.

  3. The creation of information requires fast feedback.

  4. Feedback requires limiting work in process.

  5. Limiting work in process in product design contexts requires rigorous prioritization capabilities.

  6. Effective, economical prioritization requires understanding the cost of delay for individual product features.

  7. Understanding cost of delay requires smaller batch sizes, consisting of cohesive features, not large projects.

These can be summarized as in Lean Product Development hierarchy of concerns.

Figure 93. Lean Product Development hierarchy of concerns

If a company wishes to innovate faster than competitors, it requires fast feedback on its experiments (whether traditionally understood, laboratory-based experiments, or market-facing validation as in Lean Startup. In order to achieve fast feedback, work in process must be reduced in the system, otherwise high-queue states will slow feedback down.

But how do we reduce work in process? We have to prioritize. Do we rely on the Highest Paid Person’s Opinion, or do we try something more rational? This brings us to the critical concept of cost of delay.

5.5.2. Cost of delay

If you measure only one thing, measure cost of delay.
— Don Reinertsen
Principles of Product Development Flow

Don Reinertsen is well known for advocating the concept of “cost of delay” in understanding product economics. The term is intuitive; it represents the loss experienced by delaying the delivery of some value. For example, if a delayed product misses a key trade show, and therefore its opportunity for a competitive release, the cost of delay might be the entire addressable market. Understanding cost of delay is part of a broader economic emphasis that Reinertsen brings to the general question of product development. He suggests that product developers, in general, do not understand the fundamental economics of their decisions regarding resources and work in process.

In order to understand the cost of delay, it is first necessary to think in terms of a market-facing product (such as a smartphone application). Any market-facing product can be represented in terms of its lifecycle revenues and profits (see Product lifecycle economics by year , see Product lifecycle economics, charted).

cost of delay graph
Figure 94. Product lifecycle economics by year
cost of delay graph
Figure 95. Product lifecycle economics, charted

The numbers above represent a product lifecycle, from R&D through production to retirement. The first year is all cost, as the product is being developed, and net profits are negative. In year 2, a small net profit is shown, but cumulative profit is still negative, as it remains in year 3. Only into year 3 does the product break even, ultimately achieving lifecycle net earnings of 175. But what if the product’s introduction into the market is delayed? The consequences can be severe.

Simply delaying delivery by a year, all things being equal in our example, will reduce lifeycle profits by 30% (see Product lifecycle, simple delay, Product lifecycle, simple delay, charted).

cost of delay table
Figure 96. Product lifecycle, simple delay
cost of delay graph
Figure 97. Product lifecycle, simple delay, charted

But all things are not equal. What if, in delaying the product for a year, we allow a competitor to gain a superior market position? That could depress our sales and increase our per-unit costs — both bad (see Product lifecycle, aggravated delay, Product lifecycle, aggravated delay, charted):

cost of delay table
Figure 98. Product lifecycle, aggravated delay
cost of delay graph
Figure 99. Product lifecycle, aggravated delay, charted

The advanced cost of delayed analysis argues that different product lifecycles have different characteristics. Josh Arnold of Black Swan Farming has visualized these as a set of profiles [17]. See Simple cost of delay [44] for the simple delay profile.

simple delay curve
Figure 100. Simple cost of delay

In this delay curve, while profits and revenues are lost due to late entry, it’s assumed that the product will still enjoy its expected market share. We can think of this as the “iPhone versus Android” profile, as Android was later but still achieved market parity. The aggravated cost of delay profile, however, looks like Aggravated cost of delay [45].

aggravated delay curve
Figure 101. Aggravated cost of delay

In this version, the failure to enter the market in a timely way results in long-term loss of market share. We can think of this as the “Amazon Kindle versus Barnes & Noble Nook” profile, as the Nook has not achieved parity, and does not appear likely to. There are other delay curves imaginable, such as delay curves for tightly time-limited products (e.g.,such as found in the fashion industry) or cost of delay that’s only incurred after a specific date (such as in complying with a regulation).

Reinertsen observes that product managers may think that they intuitively understand cost of delay, but when he asks them to estimate the aggregate cost of (for example) delaying their product’s delivery by a given period of time, the estimates provided by product team participants in a position to delay delivery may vary by up to 50:1. This is powerful evidence that a more quantitative approach is essential, as opposed to relying on “gut feel” or the Highest Paid Person’s Opinion.

Finally, Josh Arnold notes that cost of delay is much easier to assess on small batches of work. Large projects tend to attract many ideas for features, some of which have stronger economic justifications than others. When all these features are lumped together, it makes understanding the cost of delay a challenging proces, because it then becomes an average across the various features. But since features, ideally, can be worked on individually, understanding the cost of delay at that level helps with the prioritization of the work.

The combination of product roadmapping, a high quality DEEP backlog, and cost of delay is a solid foundation for digital product development. It’s essential to have an economic basis for making the prioritization decision. Clarifying the economic basis is a critical function of the product roadmap. Through estimation of story points, we can understand the team’s velocity. Estimating velocity is key to planning, which we’ll discuss further in Chapter 8. Through understanding the economics of product availability to the market or internal users, the cost of delay can drive backlog prioritization.

5.6. The service desk and related tools

As a digital product starts to gain a user base, and as a company matures and grows, there emerges a need for human-to-human support. This is typically handled by a help desk or service desk, serving as the human face of IT when the IT systems are not meeting people’s expectations. We were first briefly introduced to the concept in our Service Lifecycle (see The essential states of the digital service (or product)).

The service desk is an interrupt-driven, task-oriented capability. It serves as the first point of contact for IT services that require some human support or intervention. As such, its role can become broad from provisioning access to assisting users in navigation & usage, to serving as an alert channel for outage reporting. The service desk ideally answers each user’s request immediately, requiring no follow-up. If follow-up is required, a “ticket” is “issued.”

As a “help desk,” it may be focused on end user assistance and reporting incidents. As a “service desk,” it may expand its purview to accepting provisioning or other requests of various types (and referring and tracking those requests). Note that in some approaches, Service Request and Incident are considered to be distinct processes.

The term "ticket” dates to paper-based industrial processes, where the “help desk” might actually be a physical desk, where a user seeking services might be issued a paper ticket. Such “tickets” were also used in field services (see A “ticket” from early industry).

old ticket
Figure 102. A “ticket” from early industry

In IT-centric domains, tickets are virtual; they are records in databases, not paper. The user is given a ticket “ID” or “number” for tracking (e.g., so they can inquire about the request’s status). The ticket may be “routed” to someone to handle, but again in a virtual world what really happens is that the person it’s routed to is directed to look at the record in the database. (In paper based processes, the ticket might well be moved physically to various parties to perform the needed work).

A service desk capability needs:

  • Channels for accepting contacts (e.g.,telephone, email, chat)

  • Staffing appropriate to the volume and nature of those requests

  • Robust workflow capabilities to track their progress, and

  • Routing and escalation mechanisms, since clarifying the true nature of contacts and getting them serviced by the most appropriate means are nontrivial challenges

There is extensive material on managing Service Desks; the reader is referred first to the Help Desk Institute.

Work management in practice has divided between development and operations practices and tools. However, DevOps and Kanban are forcing a reconsideration and consolidation. Historically, here are some of the major tools and channels through which tasks and work are managed on both sides:

Table 7. Dev versus Ops tooling
Development Operations

User story tracking system

Service or help desk ticketing system

Issue/risk/action item log

Incident management system

Defect tracker

Change management system

All of these systems have common characteristics. All can (or should) be able to:

  • Register a new task

  • Describe the work to be done (development or break/fix/remediate)

  • Represent the current status of the work

  • Track who is currently accountable for it (individual and/or team)

  • Indicate the priority of the work, at least in terms of a simple categorization such as high/medium/low

More advanced systems may also be able to:

  • Link one unit of work to another (either as parent/child or peer-to-peer)

  • Track the effort spent on the work

  • Prioritize and order work

  • Track the referral or escalation trail of the work, if it is routed to various parties

  • Link to communication channels such as conference bridges and paging systems

The first automated system (computer-based) you may find yourself acquiring along these lines is a help desk system. You may be a small company, but when you start to build a large customer base, keeping them all happy requires more than a manual, paper-based card wall or Kanban board.

5.7. Towards process management

…​ one of the principal purposes of processes [is] to manage work without needing a manager and to reduce the cost and increase the value of outcomes of repetitive tasks. Good processes supplement management, augment its reach, and ensure consistency of quality outcomes.
— Abbott and Fisher
The Art of Scalability
For a manufacturer, reducing variability always improves manufacturing economics. This is not true in product development.
— Don Reinertsen
Principles of Product Development Flow
complex Kanban
Figure 103. Medium-complex Kanban board

The Kanban board has started to get complicated (see Medium-complex Kanban board [46]). We’re witnessing an increasing amount of work that needs to follow a sequence, or checklist, for the sake of consistency. Process management is when we need to start managing:

  • Multiple

  • Repeatable

  • Measurable sequences of activity

  • Considering their interdependencies

  • Perhaps using common methods to define them

  • And even common tooling to support multiple processes

5.7.1. Process basics

We’ve discussed some of the factors leading to the need for process management, but we haven’t yet come to grips with what it is. To start, think of a repeatable series of activities, such as when a new employee joins (see Simple process flow).

flow steps
Figure 104. Simple process flow

Process management can represent conditional logic (see Conditionality).

flow steps
Figure 105. Conditionality

Process models can become extremely intricate, and can describe both human and automated activity. Sometimes, the process simply becomes too complicated for humans to follow. Notice how different the process models are from the card wall or Kanban board. In Kanban, everything is a work item, and the overall flow is some simple version of “to do, doing, done.” This can become complex when the flow gets more elaborate (e.g., various forms of testing, deployment checks, etc.). In a process model, the activity is explicitly specified on the assumption it will be repeated. The boxes representing steps are essentially equivalent to the columns on a Kanban board, but since sticky notes are not being used, process models can become very complex -— like a Kanban board with dozens or hundreds of columns! Process modeling is discussed in detail in the appendix. Process management as a practice is discussed extensively in Part III. However, before we move on, two simple variations on process management are:

  • Checklists

  • Case Management

5.7.2. The Checklist Manifesto

Figure 106. A Boeing 747 checklist

The Checklist Manifesto is the name of a notable book by author/surgeon Atul Gawande [105]. The title can be misleading; the book in no way suggests that all work can be reduced to repeatable checklists. Instead, it is an in-depth examination of the relationship between standardization and complexity. Like case management, it addresses the problem of complex activities requiring professional judgment.

Unlike case management (discussed below), it explores more time-limited and often urgent activities such as flight operations (A Boeing 747 checklist [47]), large-scale construction, and surgery. These activities, as a whole, cannot be reduced to one master process; there is too much variation and complexity. However, within the overall bounds of flight operations, or construction, or surgery, there are critical sequences of events that MUST be executed, often in a specific order. Gawande discusses the airline industry as a key exemplar of this. Instead of one “master checklist” there are specific, clear, brief checklists for a wide variety of scenarios, such as a cargo hold door becoming unlatched.

There are similarities and differences between core Business Process Management (BPM) approaches and checklists. Often, BPM is employed to describe processes that are automated and whose progress is tracked in a database. Checklists, on the other hand, may be more manual, intended for use in a closely collaborative environment (such as an aircraft cockpit or operating room), and may represent a briefer period of time.

Full process management specifies tasks and their flow in precise detail. We have not yet got to that point with our Kanban board, but when we start adding checklists, we are beginning to differentiate the various processes at a detailed level. We will revisit Gawande’s work in Part III with the coordination technique of the submittal schedule.

5.7.3. Case management

case management
Figure 107. Process management versus case management
Do not confuse “Case” here with Computer Assisted Software Engineering.

Case management is a concept used in medicine, law, and social services. Case management can be thought of as a high-level process supporting the skilled knowledge worker applying their professional expertise. Cases are another way of thinking about the relationship between the Kanban board and process management (see Process management versus case management).

Workflow Management Coalition on Case Management

[Business Process Modeling] and [Case Management] are useful for different kinds of business situations.

  • Highly predictable and highly repeatable business situations are best supported with BPM.

    • For example signing up for cell phone service: it happens thousands of times a day, and the process is essentially fixed.

  • Unpredictable and unrepeatable business situations are best handled with Case Management.

    • For example, investigation of a crime will require following up on various clues, down various paths, which are not predictable beforehand. The are various tests and procedures to use, but they will be called only when needed.

[282], via [88]

Noted IT consultant and author Rob England contrasts “case management” with “standard process” in his book Plus! The Standard+Case Approach: See Service Response in a New Light [88]. Some processes are repeatable and can be precisely standardized, but it is critical for anyone working in complex environments to understand the limits of a standardized process. Sometimes, a large “case” concept is sufficient to track the work. The downside may be that there is less visibility into the progress of the case -— the person in charge of it needs to provide a status that can’t be represented as a simple report. We will see process management again in Chapter 6 in our discussion of operational process emergence.

5.8. Conclusion

“Work management” may seem like an oddly generic title, but it’s important to distinguish the fundamental problem from more elaborate constructs such as process and project management. At the end of the day, “it’s all just work.”

Understanding work requires understanding tasks and their lifecycle, the characteristics of queues, work in progress, and the cost of delay.

5.8.1. Discussion questions

  • When have you been on a team that needed to have a shared mental model? How did you achieve this?

  • Where do you see queues in your daily life? Are they frequently in high-queue states? What could be done to improve them?

  • How do you keep track of tasks? Compare approaches with your group.

  • Do you choose Scrum, Kanban, both, or neither?

5.8.2. Research & practice

  • Think of a queue you encounter on a regular basis. How can you understand its cost of delay?

  • Research an automated, online Kanban tool (Trello, LeanKit, or similar). Compare it to a physical white board, ideally as a team. Which do you prefer and why?

  • Recall your product ideas from earlier discussions. How would you estimate its cost of delay? Can you estimate the cost of delay by feature?

  • (Advanced) Read some of Reinertsen’s The Principles of Product Development Flow and write up or present two or three key insights you gained.

6. Operations Management

Instructor’s note

Although this chapter is titled “operations management” it also brings in infrastructure engineering at a higher-level, assuming that the product is continuing to scale up.

6.1. Introduction

Operations center
Figure 108. Olympic operations center

As your product gains more use, you inevitably find that running it becomes a distinct concern from building it. For all their logic, computers are still surprisingly unreliable. Servers running well tested software may remain “up” for weeks, and then all of a sudden hang and have to be rebooted. Sometimes it’s clear why (for example, a log file filled up that no-one expected) and in other cases, there just is no explanation.

Engineering and operating complex IT-based distributed systems is a significant challenge. Even with infrastructure as code and automated continuous delivery pipelines, operations as a class of work is distinct from software development per se. The work is relatively more interrupt-driven, as compared to the “heads-down” focus on developing new features. Questions about scalability, performance, caching, load balancing, and so forth usually become apparent first through feedback from the operations perspective -— whether or not you have a formal operations “team.”

We are still just one team with one product, but as we enter this last chapter of Part II, the assumption is that we have considerable use of our product. With today’s technology, correctly deployed and operated, even a small team can support large workloads. This does not come easily, however. Systems must be designed for scale and ease of operations. They need to be monitored, managed for performance and capacity. We will also revisit the topic of configuration management at a more advanced level.

We discussed the evolution of infrastructure in Chapter 2 and applications development in Chapter 3, and will continue to build on those foundations. The practices of Change, Incident, and Problem Management have been employed in the industry for decades and are important foundations for thinking about operations. The concept of Site Reliability Engineering is an important new discipline emerging from the practices of companies such as Google and Facebook. We will examine its principles and practices [48].

6.1.1. Chapter 6 outline

  • Introduction

  • Introducing operations management

    • Defining operations

    • The concept of “service level”

    • A day in the life

  • Monitoring

    • Monitoring techniques

    • Designing operations into products

    • Aggregation and operations centers

    • Understanding business impact

    • Capacity and performance management

  • Operational response

    • Communication channels

    • Operational process emergence

  • Designing for operations and scale

    • The CAP principle

    • The AKF scaling cube

  • Configuration management and operations

    • State, configuration, and discovery

    • Environments and the fierce god of “Production”

    • “Development is production”

  • Advanced topics in operations

    • Post-mortems, blamelessness, and operational demand

    • Site reliability engineering at Google

  • Conclusion

6.1.2. Chapter 6 learning objectives

  • Distinguish kinds of work, especially operational versus development

  • Understand basics of monitoring, event management, and impact

  • Describe the basics of change and incident management

  • Describe operational feedback into product design and the kinds of concerns it raises

  • Describe impact and dependency analysis and why it is important

6.2. An overview of operations management

6.2.1. Defining operations

…​ those unpredictable but required parts of many teams' lives—supporting a production website or database, taking support calls from key customers or first-tier technical support, and so on …​
— Mike Cohn
Agile Estimating
call center
Figure 109. Call center operators

What do we mean by operations? Operations management is a broad topic in management theory, with whole programs dedicated to it in both business and engineering schools. Companies frequently hire Chief Operations Officers to run the organization. We started to cover operations management in Chapter 5, as we examined the topic of “work management” — in traditional operations management, the question of work and who is doing it is critical. For the digital professional, “operations” tends to have a more technical meaning than the classic business definition, being focused on the immediate questions of systems integrity, availability and performance, and feedback from the user community (i.e., the service or help desk). We see such a definition from Limoncelli et al:

… operations is the work done to keep a system running in a way that meets or exceeds operating parameters specified by a Service Level Agreement (SLA). Operations includes all aspects of a service’s life cycle: from initial launch to the final decommissioning and everything in between [169 p. 147].

Operations often can mean “everything but development” in a digital context. In the classic model, developers built systems and “threw them over the wall” to operations. Each side had specialized processes and technology supporting their particular concerns. However, recall our discussion of design thinking-— the entire experience is part of the product. This applies to both those consuming it as well as running it. Companies undergoing digital transformation are experimenting with many different models; as we will see in Part III, up to and including the complete merging of Development and Operations-oriented skills under common product management.

In a digitally transformed enterprise, Operations is part of the Product.
dual axis value chain
Figure 110. Operations supports the digital moment of truth

Since we have a somewhat broader point of view covering all of digital management, for this chapter we’ll use the following definition of operations:

Operations is the direct facilitation and support of the digital value experience. It tends to be less variable, more repeatable, yet more interrupt-driven than product development work. It is more about restoring a system to a known state, and less about creating new functionality.

What do we mean by this? In terms of our dual-axis value chain, operations supports the day-to-day delivery of the digital “moment of truth” (see Operations supports the digital moment of truth). Consider the following various examples of “operations” in an IT context. Some are relevant to a “two pizza product team” scenario, some might be more applicable to larger environments:

  • Systems operators are sitting in 24x7 operations centers, monitoring system status and responding to alerts.

  • Help desk representatives answering phone calls from users requiring support (see Call center operators [49]). They may be calling because a system or service they need is malfunctioning. They may also be calling because they do not understand how to use the system for the value experience they have been led to expect from it. Again, this is part of their product experience.

  • Developers and engineers serving “on call” on a rotating basis to respond to systems outages referred to them by the operations center.

  • Data center staff performing routine work, such as installing hardware, granting access, or running or testing backups. Such routine work may be scheduled, or it may be on request (e.g.,ticketed).

  • Field technicians physically dispatched to a campus or remote site to evaluate and if necessary update or fix IT hardware and/or software—​install a new PC, fix a printer, service a cell tower antenna (see Field technician [50]).

  • Security personnel ensuring security protocols are followed, e.g., access controls.

field tech
Figure 111. Field technician

As above, the primary thing that operations does NOT do is develop new systems functionality. Operations is process-driven and systematic and tends to be interrupt-driven, whereas research and development fail the “systematic” part of the definition (review the definitions in process, product, and project management). However, new functionality usually has operational impacts. In manufacturing and other traditional industries, product development was a minority of work, while operations was where the bulk of work happened. Yet when an operational task involving information becomes well defined and repetitive, it can be automated with a computer.

This continuous cycle of innovation and commoditization has driven closer and closer ties between “development” and “operations.” This cycle also has driven confusion around exactly what is meant by “operations.” In many organizations, there is an “Infrastructure and Operations” (I&O) function. Pay close attention to the naming. A matrix may help because we have two dimensions to consider here (see Application, infrastructure, development, operations).

Table 8. Application, infrastructure, development, operations
Development phase Operations phase

Application layer

Application developers. Handle demand, proactive and reactive, from product and operations. Never under I&O.

Help desk. Application support and maintenance (provisioning, fixes not requiring software development). Often under I&O.

Infrastructure layer

Engineering team. Infrastructure platform engineering and development (design and build typically of externally sourced products). Often under I&O.

Operations center. Operational support, including monitoring system status. May monitor both infrastructure and application layers. Often under I&O.

Notice that we distinguish carefully between the application and infrastructure layers. Review our pragmatic definitions:

  • Applications are consumed by people who are NOT primarily concerned with IT

  • Infrastructure is consumed by people who ARE primarily concerned with IT

Infrastructure services and/or products, as we discussed in Chapter 2, need to be designed and developed before they are operated, just like applications. This may all seem obvious, but there is an industry tendency to lump three of the four cells in the table into the I&O function when, in fact, each represents a distinct set of concerns.

6.2.2. The concept of “service level”

Either a digital system is available and providing a service, or it isn’t. The concept of "service level” was mentioned above by Limoncelli. A level of service is typically defined in terms of criteria such as:

  • What % of the time will the service be available?

  • If the service suffers an outage, how long until it will be restored?

  • How fast will the service respond to requests?

A service level agreement, or SLA, is a form of contract between the service consumer and service provider, stating the above criteria in terms of a business agreement. When a service’s performance does not meet the agreement, this is sometimes called a “breach” and the service provider may have to pay a penalty (e.g., the customer gets a 5% discount on that month’s services). If the service provider exceeds the SLA, perhaps a credit will be issued.

SLAs drive much operational behavior. They help prioritize incidents and problems, and the risk of proposed changes are understood in terms of the SLAs.

6.3. Monitoring

Telemetry is an automated communications process by which measurements are made and other data collected at remote or inaccessible points and transmitted to receiving equipment for monitoring. The word is derived from Greek roots: 'tele' = remote, and 'metron' = measure.
— Wikpedia

Computers run in large data centers, where physical access to them is tightly controlled. Therefore, we need telemetry to manage them. The practice of collecting and initiating responses to telemetry is called monitoring.

6.3.1. Monitoring techniques

Limoncelli et al. define monitoring this way:

Monitoring is the primary way we gain visibility into the systems we run. It is the process of observing information about the state of things for use in both short-term and long-term decision-making [169].

application server and monitor
Figure 112. Simple monitoring

But how does one “observe” computing infrastructure? Clearly, sitting in the data center (assuming you could get in) and looking at the lights on the faces of servers will not convey much useful information, beyond whether they are off or on. Monitoring tools are the software that watches the software (and systems more broadly).

A variety of techniques are used to monitor computing infrastructure. Typically these involve communication over a network with the device being managed. Often, the network traffic is on the same network carrying the primary traffic of the computers. Sometimes, however, there is a distinct “out of band” network for management traffic. A simple monitoring tool will interact on a regular basis with a computing node, perhaps by “pinging” it periodically, and will raise an alert if the node does not respond within an expected timeframe (see Simple monitoring).

application server and monitor
Figure 113. Extended monitoring

More broadly, these tools provide a variety of mechanisms for monitoring and controlling operational IT systems; they may monitor:

  • Computing processes and their return codes

  • Performance metrics (e.g.,memory and CPU utilization)

  • events raised through various channels

  • Network availability

  • Log file contents (searching the files for messages indicating problems)

  • A given component’s interactions with other elements in the IT infrastructure

  • and more (see Extended monitoring)

application server and monitor
Figure 114. User experience monitoring

Some monitoring covers low-level system indicators not usually of direct interest to the end user. Other simulates end user experience; SLAs are often defined in terms of the response time as experienced by the end user (see User experience monitoring). See [169], Chapters 16-17.

All of this data may then be forwarded to a central console and be integrated, with the objective of supporting the organization’s service level agreements in priority order. Enterprise monitoring tools are notorious for requiring agents (small, continuously-running programs) on servers; while some things can be detected without such agents, having software running on a given computer still provides the richest data. Since licensing is often agent-based, this gets expensive.

Monitoring systems are similar to source control systems in that they are a critical point at which metadata diverges from the actual system under management.
relationship illustration
Figure 115. Configuration, monitoring, and element managers

Related to monitoring tools is the concept of an element manager (see Configuration, monitoring, and element managers). Element managers are low-level tools for managing various classes of digital or IT infrastructure. For example, Cisco provides software for managing network infrastructure, and EMC provides software for managing its storage arrays. Microsoft provides a variety of tools for managing various Windows components. Notice that such tools often play a dual role, in that they can both change the infrastructure configuration as well as report on its status. Many however are reliant on graphical user interfaces, which are falling out of favor as a basis for configuring infrastructure.

6.3.2. Designing operations into products

Just as Agile software development methods attempt to solve the problem associated with not knowing all of your requirements before you develop a piece of software, so must we have an agile and evolutionary development mindset for our monitoring platforms and systems.
— Abbott and Fisher
The Art of Scalability

Monitoring tools, out of the box, can provide ongoing visibility to well understood aspects of the digital product: the performance of infrastructure, the capacity utilized, and well understood, common failure modes (such as a network link being down). However, the digital product or application also needs to provide its own specific telemetry in various ways (see Custom software requires custom monitoring). This can be done through logging to output files, or in some cases through raising alerts via the network.

custom monitoring
Figure 116. Custom software requires custom monitoring

A typical way to enable custom monitoring is to first use a standard logging library as part of the software development process. The logging library provides a consistent interface for the developer to create informational and error messages. Often, multiple “levels” of logging are seen, some more verbose than others. If an application is being troublesome, a more verbose level of monitoring may be turned on. The monitoring tool is configured to scan the logs for certain information. For example, if the application writes:

APP-ERR-SEV1-946: Unresolvable database consistency issues detected, terminating application.

into the log, the monitoring tool can be configured to recognize the severity of the message and immediately raise an alert.

Finally, as the quote at the beginning of this section suggests, it is critical that the monitoring discipline is based on continuous improvement. (More to come on continuous improvement in Chapter 7). Keeping monitoring techniques current with your operational challenges is a never-ending task. Approaches that worked well yesterday, today generate too many false positives, and your operations team is now overloaded with all the noise. Ongoing questioning and improvement of your approaches are essential to keeping your monitoring system optimized for managing business impact as efficiently and effectively as possible.

6.3.3. Aggregation and operations centers

Figure 117. Aggregated monitoring

It is not possible for a 24x7 operations team to access and understand the myriads of element managers and specialized monitoring tools present in the large IT environment. Instead, these teams rely on aggregators of various kinds to provide an integrated view of the complexity (see Aggregated monitoring). These aggregators may focus on status events, or specifically on performance aspects related either to the elements or to logical transactions flowing across them. They may incorporate dependencies from configuration management to provide a true “business view” into the event streams. This is directly analogous to the concept of Andon board from Lean practices or the idea of “information radiator” from Agile principles.

24x7 operations means operations conducted 24 hours a day, 7 days a week.

A monitoring console may present a rich and detailed set of information to an operator. Too detailed, in fact, as systems become large. Raw event streams must be filtered for specific events or patterns of concern. Event de-duplication starts to become an essential capability, which leads to distinguishing the monitoring system from the event management system. Also, for this reason, monitoring tools are often linked directly to ticketing systems; on certain conditions, a ticket (e.g.,an Incident) is created and assigned to a team or individual.

Enabling a monitoring console to auto-create tickets, however, needs to be carefully considered and designed. A notorious scenario is the “ticket storm,” where a monitoring system creates multiple (perhaps thousands) of tickets, all essentially in response to the same condition.

6.3.4. Understanding business impact

At the intersection of event aggregation and operations centers is the need to understand business impact. It is not, for example, always obvious what a server is being used for. This may be surprising to new students, and perhaps those with experience in smaller organizations. However, in many large “traditional” IT environments, where the operations team is distant from the development organization, it is not necessarily easy to determine what a given hardware or software resource is doing or why it is there. Clearly, this is unacceptable in terms of security, value management, and any number of other concerns. However, from the start of distributed computing, the question “what is on that server?” has been all too frequent in large IT shops.

In mature organizations, this may be documented in a Configuration Management Database or System (CMDB/CMS). Such a system might start by simply listing the servers and their applications:

Table 9. Applications and servers
Application Server













(Imagine the above list, 25,000 rows long).

This is a start, but still doesn’t tell us enough. A more elaborate mapping might include business unit and contact:

Table 10. Business units, contacts, applications, servers
BU Contact Application Server


Mary Smith




Aparna Chaudry




Mary Smith



Human Resources

William Jones



Human Resources

William Jones







The above lists are very simple examples of what can be extensive record-keeping. But the key user story is implied: if we can’t ping SRV0001, we know that the Quadrex application supporting Logistics is at risk, and we should contact Mary Smith ASAP if she hasn’t already contacted us. (Sometimes, the user community calls right away; in other cases, they may not, and proactively contacting them is a positive and important step).

The above approach is relevant to older models still reliant on servers (whether physical or virtual) as primary units of processing. The trend to containers and serverless computing is challenging these traditional practices, and what will replace them is currently unclear.

6.3.5. Capacity and performance management

Capacity management on Star Trek

"The tank can’t handle that much pressure."

"Where’d you get that idea?"

"What do you mean, where did I get that idea? It’s in the impulse engine specifications."

"Regulations 42/15: 'Pressure Variances in IRC Tank Storage'?"


"Forget it. I wrote it …​ A good engineer is always a wee bit conservative, at least on paper."

Conversation between Geordi LaForge and Montgomery Scott, Star Trek: The Next Generation, “Relics.”

Capacity and performance management are closely related, but not identical terms encountered as IT systems scale up and encounter significant load.

A capacity management system may include large quantities of data harvested from monitoring and event management systems, stored for long periods of time so that history of the system utilization is understood and some degree of prediction can be ventured for upcoming utilization.

Macys crowded
Figure 118. Black Friday at Macy’s

The classic example of significant capacity utilization is the Black Friday/Cyber Monday experience of retailers (see Black Friday at Macy’s [51]). Both physical store and online e-commerce systems are placed under great strain annually around this time, with the year’s profits potentially on the line.

Performance management focuses on the responsiveness (e.g.,speed) of the systems being used. Responsiveness may be related to capacity utilization, but some capacity issues don’t immediately affect responsiveness. For example, a disk drive may be approaching full. When it fills, the system will immediately crash, and performance is severely affected. But until then, the system performs fine. The disk needs to be replaced on the basis of capacity reporting, not performance trending. On the other hand, some performance issues are not related to capacity. A mis-configured router might badly affect a website’s performance, but the configuration simply needs to be fixed — there is no need to handle as a capacity-related issue.

At a simpler level, capacity and performance management may consist of monitoring CPU, memory, and storage utilization across a given set of nodes, and raising alerts if certain thresholds are approached. For example, if a critical server is frequently approaching 50% CPU utilization (leaving 50% “headroom”), engineers might identify that another server should be added. Abbot and Fisher suggest, “As a general rule of thumb, we like to start at 50% as the ideal usage percentage and work up from there as the arguments dictate” [2 p. 204].

So, what do we do when a capacity alert is raised, either through an automated system or through the manual efforts of a capacity analyst? There are a number of responses that may follow:

  • Acquire more capacity

  • Seek to use existing capacity more efficiently

  • Throttle demand somehow

Capacity analytics at its most advanced (i.e. across hundreds or thousands of servers and services) is a true Big Data problem domain and starts to overlap with IT asset management, capital planning, and budgeting in significant ways. As your organization scales up and you find yourself responding more frequently to the kinds of operational issues described in this section, you might start asking yourself whether you can be more proactive. What steps can you take when developing or enhancing your systems, so that operational issues are minimized? You want systems that are stable, easily upgraded, and that can scale quickly on-demand. Fortunately, there is a rich body of experience on how to build such systems, which we will discuss in a subsequent section.

6.4. Operational response

Monitoring communicates the state of the digital systems to the professionals in charge of them. Acting on that telemetry involves additional tools and practices, some of which we’ll review in this section.

Margaret Hamilton and the origins of digital operations
Margaret Hamilton
Figure 119. Margaret Hamilton

Margaret Hamilton, who wrote the software for the Apollo lunar missions, is credited with coining the term software engineering. She also created the concept of priority displays, software that alerts astronauts to information that requires their attention in real time (see Margaret Hamilton [52]).

6.4.1. Communication channels

When signals emerge from the lower levels of the digital infrastructure, they pass through a variety of layers and cause assorted, related behavior among the responsible digital professionals. The accompanying illustration shows a typical hierarchy, brought into action as an event becomes apparently more significant (see Layered communications channels).

comms channels
Figure 120. Layered communications channels

The digital components send events to the monitoring layer, which filters them for significant concerns, for example, a serious application failure. The monitoring tool might automatically create a ticket, or perhaps it first provides an alert to the systems operators, who might instant message each other, or perhaps join a chat room.

If the issue can’t be resolved operationally before it starts to impact users, an Incident ticket might be created, which has several effects:

  • First, the situation is now a matter of record, and management may start to pay attention.

  • Accountability for managing the incident is defined, and expectations are that responsible parties will start to resolve it.

  • If assistance is needed, the Incident provides a common point of reference (it is a common reference point), in terms of Work Management.

Depending on the seriousness of the incident, further communications by instant messaging, chat, cell phone, email, and/or conference bridge may continue. Severe incidents in regulated industries may require recording of conference bridges.

What is ChatOps?

ChatOps is the tight integration of instant communications with operational execution. As Eric Sigler of PagerDuty describes it, “While in a chat room, team members type commands that the chat bot is configured to execute through custom scripts and plugins. These can range from code deployments to security event responses to team member notifications. The entire team collaborates in real-time as commands are executed” [247].

Properly configured ChatOps provides a low-friction collaborative environment, enabling a powerful and immediate collective mental model of the situation and what is being done. It also provides a rich audit trail of who did what, when, and who else was involved. Fundamental governance objectives of accountability can be considered fulfilled in this way, on par with paper or digital forms routed for approval (and without their corresponding delays).

6.4.2. Operational process emergence

Process is what makes it possible for teams to do the right thing, again and again.
— Limoncelli/Chalup/Hogan

Limoncelli, Chalup, and Hogan, in their excellent Cloud Systems Administration, emphasize the role of the “oncall” and “onduty” staff in the service of operations [169]. Oncall staff have a primary responsibility of emergency response, and the term oncall refers to their continuous availability, even if they are not otherwise working (e.g., they are expected to pick up phone calls and alerts at home and dial into emergency communications channels). Onduty staff are responsible for responding to non-critical incidents and maintaining current operations.

What is an emergency? It’s all a matter of expectations. If a system (by its SLA) is supposed to be available 24 hours a day, 7 days a week, an outage at 3 AM Saturday morning is an emergency. If it is only supposed to be available from Monday through Friday, the outage may not be as critical (although it still needs to be fixed in short order, otherwise there will soon be an SLA breach!).

IT systems have always been fragile and prone to malfunction. “Emergency” management is documented as a practice in “data processing” as early as 1971 [82 pp. 188-189]. In the last chapter, we discussed how simple task management starts to develop into process management. Certainly, there is a concern for predictability when the objective is to keep a system running, and so process management gains strength as a vehicle for structuring work. By the 1990s, a process terminology was increasingly formalized, by vendors such as IBM (in their “Yellow Book” series), the United Kingdom’s IT Infrastructure Library (ITIL), and other guidance such as the Harris Kern library (popular in the United States before ITIL gained dominance). These processes include:

  • Request management

  • Incident management

  • Problem management

  • Change management

Even as a single-product team, these processes are a useful framework to keep in mind as operational work increases. See Basic operational processes for definitions of the core processes usually first implemented.

Table 11. Basic operational processes
Process Definition

Request management

Respond to routine requests such as providing systems access.

Incident management

Identify service outages and situations that could potentially lead to them, and restore service and/or mitigate immediate risk.

Problem management

Identify the causes of one or more Incidents and remedy them (on a longer-term basis).

Change management

Record and track proposed alterations to critical IT components. Notify potentially affected parties and assess changes for risk; ensure key stakeholders exercise approval rights.

These processes have a rough sequence to them:

  1. Give the user access to the system.

  2. If the system is not functioning as expected, identify the issue and restore service by any means necessary. Don’t worry about why it happened yet.

  3. Once service is restored, investigate why the issue happened (sometimes called a post-mortem) and propose longer-term solutions.

  4. Inform affected parties of the proposed changes, collect their feedback and approvals, and track the progress of the proposed change through successful completion.

At the end of the day, we need to remember that operational work is just one form of work. In a single-team organization, these processes might still be handled through basic task management (although user provisioning would need to be automated if the system is scaling significantly). It might be that the simple task management is supplemented with checklists since repeatable aspects of the work become more obvious. We’ll continue on the assumption of basic task management for the remainder of this chapter, and go deeper into the idea of defined, repeatable processes as we scale to a “team of teams” in Part III.

6.4.3. Post-mortems, blamelessness, and operational demand

We briefly mentioned problem management as a common operational process. After an incident is resolved and services are restored, further investigation (sometimes called “root cause analysis”) is undertaken as to the causes and long-term solutions to the problem. This kind of investigation can be stressful for the individuals concerned and human factors become critical.

The term "root cause analysis” is viewed by some as misleading, as complex systems failures often have multiple causes. Other terms are post-mortems or simply causal analysis.

We have discussed psychological safety previously. Psychological safety takes on an additional and even more serious aspect when we consider major systems outages, many of which are caused by human error. There has been a long history of management seeking individuals to “hold accountable” when complex systems fail. This is an unfortunate approach, as complex systems are always prone to failure. Cultures that seek to blame do not promote a sense of psychological safety.

The definition of "counterfactual” is important. A “counterfactual” is seen in statements of the form, “If only Joe had not re-indexed the database, then the outage would not have happened.” It may be true that if Joe had not done so, the outcome would have been different. But there might be other such counterfactuals. They are not helpful in developing a continual improvement response. The primary concern in assessing such a failure is, how was Joe put in a position to fail? Put differently, how is it that the system was designed to be vulnerable to such behavior on Joe’s part? How could it be designed differently, and in a less sensitive way?

This is, in fact, how aviation has become so safe. Investigators with the unhappy job of examining large-scale airplane crashes have developed a systematic, clinical, and rational approach for doing so. They learned that if the people they were questioning perceived a desire on their part to blame, the information they provided was less reliable. (This, of course, is obvious to any parent of a four year old).

John Allspaw, CTO of Etsy, has pioneered the application of modern safety and incident investigation practices in digital contexts and notably has been an evangelist for the work of human factors expert and psychologist Sidney Dekker. Dekker summarizes attitudes towards human error as falling into either the Old or New Views. He summarizes the old view as the Bad Apple theory:

  • Complex systems would be fine, were it not for the erratic behavior of some unreliable people (Bad Apples) in it.

  • Human errors cause accidents: humans are the dominant contributor to more than two thirds of them.

  • Failures come as unpleasant surprises. They are unexpected and do not belong in the system. Failures are introduced to the system only through the inherent unreliability of people.

Dekker contrasts this with the New View:

  • Human error is not a cause of failure. Human error is the effect, or symptom, of deeper trouble.

  • Human error is not random. It is systematically connected to features of people’s tools, tasks, and operating environment.

  • Human error is not the conclusion of an investigation. It is the starting point. [78]

Dekker’s principles are an excellent starting point for developing a culture that supports blameless investigations into incidents. We will talk more systematically of culture in Chapter 7.

Finally, once a post-mortem or problem analysis has been conducted, what is to be done? If work is required to fix the situation (and when is it not?), this work will compete with other priorities in the organization. Product teams typically like to develop new features, not solve operational issues that may call for reworking existing features. Yet serving both forms of work is essential from a holistic, design thinking point of view.

In terms of queuing, operational demand is too often subject to the equivalent of queue starvation — which as Wikipedia notes is usually the result of “naive scheduling algorithms.” If we always and only work on what we believe to be the “highest priority” problems, operational issues may never get attention. One result of this is the concept of technical debt, which we discuss in Part IV.

6.5. Configuration management and operations

6.5.1. State and configuration

In computer science and automata theory, the state of a digital logic circuit or computer program is a technical term for all the stored information, at a given instant in time, to which the circuit or program has access. The output of a digital circuit or computer program at any time is completely determined by its current inputs and its state.
— Wikipedia

In all of IT (whether “infrastructure” or “applications” there is a particular concern with managing state. IT systems are remarkably fragile. One incorrect bit of information — a “0” instead of a “1” — can completely alter a system’s behavior, to the detriment of business operations depending on it.

Therefore, any development of IT — starting with the initial definition of the computing platform — depends on the robust management state.

The following are examples of state:

  • The name of a particular server

  • The network address of that server

  • The software installed on that server, in terms of the exact version and bits that comprise it.

State also has more transient connotations:

  • The current processes listed in the process table

  • The memory allocated to each process

  • The current users logged into the system

Finally, we saw in the previous section some server/application/business mappings. These are also a form of state.

It is therefore not possible to make blanket statements like “we need to manage state.” Computing devices go through myriads of state changes with every cycle of their internal clock. (Analog and quantum computing are out of scope for this book).

The primary question in managing state is “what matters”? What aspects of the system need to persist, in a reliable and reproducible manner? Policy-aware tools are used extensively to ensure that the system maintains its configuration, and that new functionality is constructed (to the greatest degree possible) using consistent configurations throughout the digital pipeline.

6.5.2. Environments and the fierce god of “Production”

“Don’t mess with that server! It’s … Production!!!”
— unknown

“Production” is a term that new IT recruits rapidly learn has forbidding connotations. To be “in production” means that the broader enterprise value stream is directly dependent on that asset. Breakage or mishandling will result in questions and concerns from powerful forces. Be Very Scared.

How do things get to be “in production”? What do we mean by that?

First, let’s get back to our fundamental principle that there is an IT system delivering some “moment of truth” to someone. This system can be of any scale, but as above we are able to conceive of it having a “state.” When we want to change the behavior of this system, we are cautious. We reproduce the system at varying levels of fidelity (building “lower” environments with infrastructure as code techniques) and experiment with potential state changes. This is called development. When we start to gain confidence in our experiments, we increase the fidelity and also start to communicate more widely that we are contemplating a change to the state of the system. We may increase the fidelity along a set of traditional names (see Example environment pipeline environment pipeline):

  • Development

  • Testing

  • Quality assurance

  • Integration

  • Performance Testing

The final state, where value is realized, is “production.” Moving functionality in smaller and smaller batches, with increasing degrees of automation, is called continuous delivery.

Figure 121. Example environment pipeline environment pipeline

The fundamental idea that one sequentially moves (“promotes”) new system functionality through a series of states to gain confidence before finally changing the state of the production system is historically well established. You will see many variations, especially at scale, on the environments listed above. However, the production state is notoriously difficult to reproduce fully, especially in highly distributed environments. While infrastructure as code has simplified the problem, lower environments simply can’t match production completely in all its complexity, especially interfaced interactions with other systems or when large, expensive pools of capacity are involved. Therefore there is always risk in changing the state of the production system. Mitigating strategies include:

  • Extensive automated test harnesses that can quickly determine if system behavior has been unfavorably altered.

  • Ensuring that changes to the production system can be easily and automatically reversed. For example, code may be deployed but not enabled until a "feature toggle” is set. This allows quick shutdown of that code if issues are seen.

  • Increasing the fidelity of lower environments with strategies such as service virtualization to make them behave more like production.

  • Hardening services against their own failure in production, or the failure of services on which they depend.

  • Reducing the size (and therefore complexity and risk) of changes to production (a key DevOps/continuous delivery strategy). Variations here include:

    • Small functional changes (“one line of code”)

    • Small operational changes (deploying a change to just one node out of 100, and watching it, before deploying to the other 99 nodes)

  • Using policy-aware infrastructure management tools.

Another important development in environmental approaches is A/B testing or canary deployments. In this approach, the “production” environment is segregated into two or more discrete states, with different features or behaviors exposed to users in order to assess their reactions. Netflix uses this as a key tool for product discovery, testing the user reaction to different user interface techniques for example. Canary deployments are when a change is deployed to a small fraction of the user base, as a pilot.

Do we need environments at all?

Perhaps the concept of “environment” should be eliminated, as it can reinforce functional silos and waterfall thinking, and potentially the waste of fixed assets. Performance environments (that can emulate production at scale) are particularly in question.

Instead, in a dynamic infrastructure environment (private or public), you simply define the kind of test you want to perform and provisions that capacity on-demand.

6.5.3. “Development is production”

It used to be that the concept of “testing in production” was frowned upon. Now, with these mitigating strategies, and the recognition that complex systems cannot ever be fully reproduced, there is more tolerance for the idea. But with older systems that may lack automated testing, incremental deployment, or easy rollback, it is strongly recommended to retain existing promotion strategies, as these are battle-tested and known to reduce risk. Often, their cycle time can be decreased.

On the flip side, development systems must never be treated casually.

  • The development pipeline itself represents a significant operational commitment.

  • The failure of a source code repository, if not backed up, could wipe out a company (see [179]).

  • The failure of a build server or package repository could be almost as bad.

  • In the digital economy, dozens or hundreds of developers out of work represents a severe operational and financial setback, even if the “production” systems continue to function.

It is, therefore, important to treat “development” platforms with the same care as production systems. This requires nuanced approaches: with infrastructure as code, particular VMs or containers may represent experiments, expected to fail often and be quickly rebuilt. No need for burdensome change processes when virtual machine base images and containers are being set up and torn down hundreds of times each day! However, the platforms supporting the instantiation and teardown of those VMs are production platforms, supporting the business of new systems development.

6.6. Designing for operations and scale

Building a scalable system does not happen by accident.
— Limoncelli/Chalup/Hogan
Unfortunately, because of misinformation and hype, many people believe that the cloud provides instant high availability and unlimited scalability for applications.
— Abbott and Fisher
The Art of Scalability

Designing complex systems that can scale effectively and be operated efficiently is a challenging topic. Many insights have been developed by the large-scale public-facing Internet sites, such as Google, Facebook, Netflix, and others. The recommended reading at the end of this chapter provides many references.

A reasonable person might question why systems design questions are appearing here in the chapter on operations. We have discussed certain essential factors for system scalability previously: cloud, infrastructure as code, version control, and continuous delivery. These are all necessary, but not sufficient to scaling digital systems. Once a system starts to encounter real load, further attention must go to how it runs, as opposed to what it does. It’s not easy to know when to focus on scalability. If product discovery is not on target, the system will never get the level of use that requires scalability. Insisting that the digital product has a state of the art and scalable design might be wasteful if the team is still searching for a Minimum Viable Product (in Lean Startup terms). Of course, if you are doing systems engineering and building a “cog,” not growing a “flower," you may need to be thinking about scalability earlier.

So, what often happens is that the system goes through various prototypes until something with market value is found, and at that point, as use starts to scale up, the team scrambles for a more robust approach. Chapter 2 warned of this, and mentioned the Latency Numbers; now would be a good time to review those.

There are dozens of books and articles discussing many aspects of how to scale systems. In this section, we will discuss two important principles (the CAP principle and the AKF scaling cube). If you are interested in this topic in depth, check out the references in this chapter.

6.6.1. The CAP principle

CAP triangle
Figure 122. CAP principle

Scaling digital systems used to imply acquiring faster and more powerful hardware and software. If a 4-core server with 8 gigabytes of RAM isn’t enough, get a 32-core server with 256 gigabytes of RAM (and upgrade your database software accordingly, for millions of dollars more). This kind of scaling is termed “vertical” scaling. However, web-scale companies such as Facebook and Google determined that this would not work indefinitely. Verticle scaling in an infinite capacity is not physically (or financially) possible. Instead, these companies began to experiment aggressively with using large numbers of inexpensive commodity computers.

The advantage to verticle scaling is that all your data can reside on one server, with fast and reliable access. As soon as you start to split your data across servers, you run into the practical implications of the CAP principle (see CAP principle).

CAP stands for:

  • Consistency

  • Availability

  • Partition-tolerance

and the CAP principle (or yheorem) states that it is not possible to build a distributed system that guarantees all three [101]. What does this mean? First, let’s define our terms:

Consistency means that all the servers (or “nodes”) in the system see the same data at the same time. If an update is being processed, no node will see it before any other. This is often termed a transactional guarantee, and it is the sort of processing relational databases excel at.

For example, if you change your flight, and your seat opens up, a consistent reservation application will show the free seat simultaneously to anyone who inquires, even if the reservation information is replicated across two or more geographically distant nodes. If the seat is reserved, no node will show it available, even if it takes some time for the information to replicate across the nodes. The system will simply not show anyone any data until it can show everyone the correct data.

Availability means what it implies: that the system is available to provide data on request. If we have many nodes with the same data on them, this can improve availability, since if one is down, the user can still reach others.

Partition-tolerance is the ability of the distributed system to handle communications outages. If we have two nodes, both expected to have the same data, and the network stops communicating between them, they will not be able to send updates to each other. In that case, there are two choices: either stop providing services to all users of the system (failure of availability) or accept that the data may not be the same across the nodes (failure of consistency).

In the earlier years of computing, the preference was for strong consistency, and vendors such as Oracle profited greatly by building database software that could guarantee it when properly configured. Such systems could be consistent and available, but could not tolerate network outages — if the network was down, the system, or at least a portion of it, would also be down.

Companies such as Google and Facebook took the alternative approach. They said, “We will accept inconsistency in the data so that our systems are always available.” Clearly, for a social media site such as Facebook, a posting does not need to be everywhere at once before it can be shown at all. To verify this, simply post to a social media site using your computer. Do you see the post on your phone, or your friend’s, as soon as you submit it on your computer? No, although it is fast, you can see some delay. This shows that the site is not strictly consistent; a strictly consistent system would always show the same data across all the accessing devices.

The challenge with accepting inconsistency is how to do so. Eventually, the system needs to become consistent, and if conflicting updates are made they need to be resolved. Scalable systems in general favor availability and partition-tolerance as principles, and therefore must take explicit steps to restore consistency when it fails. The approach taken to partitioning the system into replicas is critical to managing eventual consistency, which brings us to the AKF scaling cube.

For further discussion, see [169], section 1.5.

6.6.2. The AKF scaling cube

AKF cube
Figure 123. AKF scaling cube

Another powerful tool for thinking about scaling systems is the AKF Scaling Cube (see AKF scaling cube [53]). AKF stands for Abbott, Keeven, and Fisher, authors of The Art of Scalability [2]. The AKF cube is a visual representation of the three basic options for scaling a system:

  • Replicate the complete system (x-axis)

  • Split the system functionally into smaller layers or components (y-axis)

  • Split the system’s data (z-axis)

POS terminals
Figure 124. Point of sale terminals — horizontal scale

A complete system replica is similar to the Point of Sale (POS) terminals in a retailer (see Point of sale terminals — horizontal scale [54]). Each is a self-contained system with all the data it needs to handle typical transactions. POS terminals do not depend on each other; therefore you can keep increasing the capacity of your store’s checkout lines by simply adding more of them.

Functional splitting is when you separate out different features or components. To continue the retail analogy, this is like a department store; you view and buy electronics, or clothes, in those specific departments. The store “scales” by adding departments, which are self-contained in general; however, in order to get a complete outfit, you may need to visit several departments. In terms of systems, separating web and database servers is commonly seen — this is a component separation. E-commerce sites often separate “show” (product search and display) from “buy” (shopping cart and online checkout); this is a feature separation. Complex distributed systems may have large numbers of features and components, which are all orchestrated together into one common web or smartphone app experience.

conference registrations
Figure 125. Partitioning by data range at a conference

Data splitting (sometimes termed "sharding”) is the concept of “partitioning” from the CAP discussion, above. Have you ever checked into a large event, and the first thing you see is check-in stations divided by alphabet range (see Partitioning by data range at a conference [55])? For example:

  • A-H register here

  • I-Q register here

  • R-Z register here

This is a good example of splitting by data. In terms of digital systems, we might split data by region; customers in Minnesota might go to the Kansas City data center, while customers in New Jersey might go to a North Carolina data center. Obviously, the system needs to handle situations where people are traveling or move.

There are many ways to implement and combine the three axes of the AKF scaling cube to meet the CAP constraints (consistency, availability, and partition-tolerance). With further study of scalability, you will encounter discussions of:

  • Load balancing architectures and algorithms

  • Caching

  • Reverse proxies

  • Hardware redundancy

  • Designing systems for continuous availability during upgrades

and much more. For further information, see [2, 169].

6.7. Advanced topics in operations

6.7.1. Drills, game days, and Chaos Monkeys

…​ just designing a fault-tolerant architecture is not enough. We have to constantly test our ability to actually survive these “once in a blue moon” failures.

This was our philosophy when we built Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables -— all the while we continue serving our customers without interruption. By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system and build automatic recovery mechanisms to deal with them.
— Izrailevsky and Tseitlin
Netflix Tech Blog

As noted above, it is difficult to fully reproduce complex production infrastructures as “lower” environments. Therefore, it is difficult to have confidence in any given change until it has been run in production.

The need to emulate “real world” conditions is well understood in the military, which relies heavily on drill and exercises to ensure peak operational readiness. Analogous practices are emerging in digital organizations, such as the concept of “Game Days” -— defined periods when operational disruptions are simulated and the responses assessed. A related set of tools is the Netflix Simian Army, a collection of resiliency tools developed by the online video-streaming service Netflix. It represents a significant advancement in digital risk management, as previous control approaches were too often limited by poor scalability or human failure (e.g.,forgetfulness or negligence in following manual process steps).

Chaos Monkey is one of a number of tools developed to continually “harden” the Netflix system, including:

  • Latency Monkey -— introduces arbitrary network delays

  • Conformity Monkey -— checks for consistency with architectural standards, and shuts down non-conforming instances

  • Doctor Monkey -— checks for longer-term evidence of instance degradation

  • Janitor Monkey -— checks for and destroys unused running capacity

  • Security Monkey -— an extension of Conformity Monkey, checks for correct security configuration

  • 10-18 Monkey -— checks internationalization

  • Finally, Chaos Gorilla simulates the outage of an entire Amazon availability zone

On the whole, the Simian Army behaves much as antibodies do in an organic system. One notable characteristic is that the monkeys as described do not generate a report (a secondary artifact) for manual followup. They simply shut down the offending resources.

Such direct action may not be possible in many environments but represents an ideal to work toward. It keeps the security and risk work “front and center” within the mainstream of the digital pipeline, rather than relegating it to the bothersome “additional work” it can so easily be seen as.

A new field of Chaos Engineering is starting to emerge centered on these concepts.

6.7.2. Site reliability engineering at Google

SRE is what happens when you ask a software engineer to design an operations team.
— Benjamin Treynor Sloss
in Beyer/Site Reliability Engineering

Site reliability engineering is a new term for operations-centric work, originating from Google and other large digital organizations. It is clearly an operational discipline; the SRE team is responsible for the “availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service” [28 p. 7].

Google site reliability engineers have strong technical backgrounds, frequently computer science, which is atypical for operations staff in the traditional IT industry. SREs are strongly incented to automate as much as possible, avoiding “toil” (i.e., repetitive, non-value-add tasks). In other words, as Benjamin Sloss says, “we want systems that are automatic, not just automated” [28].

Google has pioneered a number of innovative practices with its SRE team, including:

  • A 50% cap on aggregate “ops” work — the other 50% is supposed to be spent on development

  • The concept of an “error budget” as a control mechanism — teams are incented not for “zero downtime” but rather to take the risk and spend the error budget

  • “Release Engineer” as a specific job title for those focused on building and maintaining the delivery pipeline

6.8. Conclusion

While operations management is an integral part of “DevOps,” and of a given product’s management, it is still a distinct area of concern and practice. The focus on interrupt-driven work results in different personality types being attracted to it and different management approaches required.

Operations management has a broader industry meaning, but in the narrower IT sense, includes concerns of monitoring, state and configuration management, capacity and performance concerns, and related topics.

6.8.1. Discussion questions

  • What experiences have you had where operational execution made the difference (either good or bad?) e.g., systems down, bad customer service? Good service?

  • Have you or anyone you’ve known worked in a call center or operations center? What was it like?

  • Do you prefer interrupt-driven or focused development work?

  • Read Ops, DevOps, and PaaS (NoOps) at Netflix and Allspaw’s response. Discuss as a team.

  • What do you think of the concept of an “error budget” (like Google has) rather than just insisting that all systems should be up as much as possible?

6.8.2. Research & practice

  • Stand up two VMs, one with a web server, the other with Nagios or another open-source monitoring tool. Experiment with using Nagios to monitor various aspects of the web server.

  • Define customer application-level monitoring and test that it operates.

  • Integrate a monitoring system such as Nagios with an IT service management tool such as iTOP, so that certain events auto-create Incident tickets.

  • In a team Kanban setting, experiment with both user stories and operations requests coming into the same board or product team. What issues arise? How can you best handle the diversity of work? What if there were ten times as much, what would you do?

6.8.5. Videos

Part II Conclusion

In this section, we considered the basic elements necessary for a collaborative product team to achieve success while still at a manageable human scale.

Our IT-centric team needed capabilities for evolving its product, managing its work, and operating its product. In some cases, time and space shifting might drive the team to automate basic capabilities such as work management and ticketing. However, the overall assumption was that, for the most part, people are co-located and still can communicate with minimal friction.

Part II leads logically to Part III. We have a high-functioning team. But a team cannot scale indefinitely. We now have no choice but to organize as a team of teams.

Part III — Team of Teams

Large meeting
Figure 126. All hands meeting at NASA Goddard
Team of teams

Team of Teams: New Rules of Engagement for a Complex World is the name of a 2015 book by General Stanley McChrystal, describing his experiences as the commander of Joint Special Operations Command in the Iraq conflict. It describes how the U.S. military was being beaten by a foe with inferior resources, and its need to shift from a focus on mechanical efficiency to more adaptable approaches. The title is appropriate for this section, as moving from “team” to “team of teams” is one of the most challenging transitions any organization can make.


You are now a “team of teams,” at a size where face to face communication is increasingly supplemented by other forms of communication and coordination. Your teams are all good and get results but in different ways. You need some level of coordination, and not everyone is readily accessible for immediate communication; people are no longer co-located, and there may be different schedules involved.[56]

You now have multiple products. As you scale up, you now must split your products into features and components (the y-axis of the AKF scaling cube). Then as you move from your first product to adding more, then organizational evolution is required. You try to keep your products from developing unmanageable interdependencies, but this is an ongoing challenge. Tensions between various teams are starting to emerge. You are seeing specialization in your organization increasing. You see a tendency of specialists to identify more with their field than with the needs of your customers and your business. There is an increasing desire among your stakeholders and executives for control and predictability. Resources are limited and always in contention. You are considering various frameworks for managing your organization. As we scale, however, we need to remember that our highest value is found in fast-moving, committed, multi-skilled teams. Losing sight of that value is a common problem for growing organizations. This is where it gets hard.

As you become a manager of managers, your concerns again shift. In Part II, you had to delegate product management (are they building the right thing?) and take concern for basic work management and digital operations. Now, as your organization grows, you are primarily a manager of managers, concerned with providing the conditions for your people to excel:

  • Defining how work is executed, in terms of decision rights, priorities, and conflicts

  • Setting the organizational mission and goals that provide the framework for making investments in products and projects

  • Instituting labor, financial, supply chain, and customer management processes and systems

  • Providing facilities and equipment to support digital delivery

  • Resolving issues and decisions escalated from lower levels in the organization

(influenced by [214]).

New employees are bringing in their perspectives, and the more experienced ones seem to assume that the company will use “projects” and “processes” to get work done. There is no shortage of contractors and consultants who advocate various flavors of the process and project management; while some advocate older approaches and “frameworks,” others propose newer Agile & Lean perspectives. However, the ideas of process and project management are occasionally called into question by both your employees and various “thought leaders,” and it’s all very confusing.

Welcome to the coordination problem. We need to understand where these ideas came from, how they relate to each other, and how they are evolving in a digitally transforming world.

Here is an overview of Part III’s structure:

Special section: Scaling the organization and its work

Digital professionals use a number of approaches to defining and managing work at various scales. Our initial progression from the product, to work, to operations management, can be seen as one dimension. We consider a couple of other dimensions as a basis for ordering Part III.

Chapter 7: Coordination

Going from one to multiple teams is hard. No matter how you structure things, there are dependencies requiring coordination. How do you ensure that broader goals are met when teams must act jointly? Some suggest project management, while others argue that you don’t need it any more — it’s all about continuous flow through loosely-coupled product organizations. But you’ve seen that your most ambitious ideas require some kind of choreography and that products and projects need certain resources and services delivered predictably. When is work repeatable? When is it unique? Understanding the difference is essential to your organization’s success. Is variability in the work always bad? These are questions that have preoccupied management thinkers for a long time.

Chapter 8: Planning and investment

Each team also represents an investment decision. You now have a portfolio of features, and/or products. You need a strategy for choosing among your options and planning -— at least at a high level -— in terms of costs and benefits. Some of you may be using project management to help manage your investments. Your vendor relationships continue to expand; they are another form of strategic investment, and you need to deepen your understanding of matters like cloud contracts and software licensing. Finally, what is your approach to finance and budgeting?

In terms of classic project methodology, Chapter 8 includes project initiating and planning. Execution, monitoring and control of day-to-day work are covered in Chapter 7. The seemingly backwards order is deliberate, in keeping with the emergence model.

Chapter 9: Organization and culture

You’re getting big.In order to keep growing, you have had to divide your organization. How are you formally structured? How are people grouped, and to whom do they report, with what kind of expectations? Finally, what is your approach to bringing new people into your organization? What are the unspoken assumptions that underly your daily work -— in other words, what is your culture? Does your culture support high performance, or the opposite? How can you measure and know such a thing?

Part III, like the other parts, needs to be understood as a unified whole. In reality, growing companies struggle with the issues in all three chapters simultaneously.

Special section: Scaling the organization and its work

Avoid large projects. Start small and quickly develop a product with the minimum functionality. If you have to employ a large project, scale slowly and grow the project organically by adding one team at a time. Starting with too many people causes products to be overly complex, making future product updates time-consuming and expensive.
— Roman Pichler
Agile Product Management with Scrum

As we begin the second half of this book, consider Pichler’s advice above. We have spent Chapters 1 through 6 (the first half of the book) thinking mainly in terms of one product and its dimensions. We are scaling now because we must; we have increasingly diverse product opportunities, or one product that has become so large it must be partitioned in some manner. Or both.

The two dimensions of demand management

To provide a framework for Part III, let’s start with this two-dimensional analysis in (Two dimensions of demand management).

You should spend some time reviewing the graphic, which provides a unique way of understanding the work you are now experiencing as a "team of teams” or “manager of managers” in an IT-dependent environment of increasing size and complexity. We’ve come a long way since our discussion of work management. By the time we started to formalize operations, we saw that work was tending to differentiate. Still, regardless of the label, we put on a given activity, it represents some set of tasks or objectives that real people are going to take the time to perform, and expect to be compensated for. It is all demand, requiring management. Remembering this is essential to digital management.

complex 2d figure
Figure 127. Two dimensions of demand management

Let’s consider the various forms that demand may take. Understanding these demand forms will also help you develop a deeper understanding of an architecture of IT management, a topic I have explored in other works [25]. The diagram has two dimensions:

  • Planning

  • Granularity

Planning. As an organization scales, there is an increasing span in your time horizon and the scope of work you are considering and executing. From the immediate, “hand-to-mouth” days of your startup, you now must take concern for longer and longer timeframes: contracts, regulations, and your company’s strategy as it grows all demand this.

Granularity. The terminology you use to describe your work also becomes more diverse, reflecting in some ways the broader time horizons you are concerned with. Requests, changes, incidents, work orders, releases, stories, features, problems, major incidents, epics, refreshes, products, programs, strategies; there is a continuum of how you think about your organization’s work efforts. Mostly, the range of work seems tied to how much planning time you have, but there are exceptions: disasters take a lot of work, but you don’t get much advance warning! So the size of work is independent of the planning horizon.

The bubbles represent a “space” where one is likely to find that kind of work. As indicated by the central diagonal, it reflects an assumption that larger amounts of work are more likely to be planned further in advance. However, this is not always true. A large, unwelcome amount of required work that shows up with no planning is probably a disaster. Desired work (in the form of aggregate transactional demand) may also spike unexpectedly. Transactional demand considered across a long timeframe is capacity management. Work items of varying sizes lists various examples.

Table 12. Work items of varying sizes
Type of work Description

Core transactional demand

This is the demand on the fully automated IT system for a given moment of truth: a banking account lookup, a streaming movie, a Human Resources record update

Routine service requests and incidents

Service requests are predefined, process-driven work items, rarely requiring creative thought or analysis. Incidents span a spectrum, but some are simpler and more routine than others, especially those stemming from user misunderstanding or error.


Changes represent modifications of established IT functionality or state. They represent some definite risk to one or more IT services, which is why they are planned on a longer lead time. However, they ideally remain relatively granular, which helps reduce their risk.

Routine releases, stories, reports

Releases and (in the Agile world) stories represent larger increments of functionality


A Project is a large, planned amount of work with a defined end date. It might create a Service, which also represents a commitment to a large, ongoing amount of work, perhaps comparable in scope to the Project.

Major incidents

Major incidents by definition are not planned. But they represent a significant amount of work to overcome.

Some forms of work may lead to other forms of work. For example, Projects may manifest themselves as Stories, Releases, and Changes. This complicates the diagram a bit; we don’t want to “double-count” work effort. But not all Releases derive from Projects, and not all Project work (especially in complex environments) can be cleanly reduced to a set of smaller tasks.

The final point of this diagram: you only have so much capacity to execute the work it implies. If you have a disaster or a series of major incidents, this unplanned work may impact your ability to deliver user stories, changes, or even meet transactional demand. Trade-offs must be considered.

Adding a third dimension with Cynefin

Figure 128. Cynefin thinking framework

This third dimension of variability is challenging to understand and touches on our earlier discussion of systems thinking. A helpful framework to understand it is the Cynefin framework, by Dave Snowden and Cynthia Kurtz [163] (see Cynefin thinking framework [57]). Cynefin proposes that there are five major domains useful in understanding situations:

  • Simple/Obvious

  • Complicated

  • Complex

  • Chaotic

  • Disorder

The simple or obvious domain is straightforward, repeatable, and cause and effect are known. The concept of "best practice” applies. The mode of action is to sense, categorize, and respond.

variability vector
Figure 129. Variability as Cynefin domains

The complicated domain requires analysis and expertise; there may be several right or at least serviceable answers. Rational thought is possible, and cause and effect relationships may be more challenging to understand, but still are applicable. Mode of action is to sense, analyze, and respond.

The complex domain is that of systems thinking. Cause and effect are apparent only in hindsight. Interdependencies complicate action. Reinforcing loops can quickly accelerate or conversely, counterbalancing loops kick in and prevent desired changes from happening. Both make linear assumptions hazardous. Mode of action is to probe, sense, and respond (“probe” being to make a small change). Much of modern product development and DevOps thinking is optimized for this domain because simple and linear approaches have so frequently failed.

In the chaotic domain, cause and effect are not apparent even in hindsight. The situation is completely unpredictable, and action is essential — better to act in any direction rather than being paralyzed. The mode of action is to act, sense, and respond.

Finally, disorder is considered to be the domain you’re in when you have not figured out which of the other four applies.

risk curve
Figure 130. Part II: increasing certainty (credit to Cantor)

The two dimensional model above (Two dimensions of demand management) does not describe how uncertain or unpredictable the work is, however. The predictability of the work is also an independent dimension. You might have two projects, both taking the same effort. One of them you were able to predict easily, while the other one was not predictable — more precisely, your expected time, effort and cost were a long way off from what you wound up spending. (Usually in an unfavorable direction).

Part II (Chapters 4-6, which we just finished) can be viewed as a logical progression from the uncertainty of developing a novel product, to the day-to-day work of building its features, to its predictable operation. The “predictability curve” illustrated in Part II: increasing certainty (credit to Cantor) [58] increases as the digital product stabilizes and moves to a fully operational state.

This question of predictability, of the degree to which actuals track estimates and can be known in advance, will be an ongoing theme throughout Part III. As we scale up, our organization takes on more and more work of all kinds, from highly uncertain to very predictable. Understanding the differences in this “portfolio” of work is essential to managing it correctly. There has always been an element of risk; as a startup, your success was not guaranteed! You now find that you are managing different classes of risk simultaneously, and “one size fits all” approaches do not work.

You might have a program to upgrade the memory on 80,000 identical POS terminals across 2,000 retail stores. It’s going to take a lot of work; you’ll be “rolling trucks” in all 50 states! But you are sure that you can estimate this work with a high degree of accuracy; it has high predictability. In terms of the Cynefin framework, it’s an obvious (or maybe complicated) problem. On the other hand, creating a completely new POS system for your stores is an unpredictable effort. Your original estimate for this large program might be off by orders of magnitude. Its predictability is low. It’s a complex problem.

Or perhaps you are writing reports using a well understood database and reporting tool. This work will be likely more predictable work -— even if complicated in the Cynefin sense -— as compared to developing the first few stories on a completely new architecture. This is true even if the estimated size of the work is the same for both the reports and the new stories. As a dimension, variability is independent of the size of the work (although the two may be correlated).

One of the most challenging open questions as Agile and DevOps continue to increase their influence is organizations with larger portfolios of older, less risky systems, for example, systems developed by external vendors but still run in-house. Not everything is available or suitable to be run under a Software as a Service model. The staffing ratios and work models required for such portfolios may not necessarily support the ideal of high-performance, cross-functional teams. Nor will these systems benefit from high performance Lean Product Development techniques; such approaches might be overkill. The industry is just beginning to think about these issues.

The Betz organizational scaling cube

3d cube
Figure 131. Betz organization scaling cube

When we combine the three dimensions:

  • Size of work;

  • Time horizon; and

  • Predictability

we get the Betz organizational scaling cube (see Betz organization scaling cube [59]). It shows the three dimensions we’ll consider throughout Part III. The accompanying cube shows these dimensions visually. The three dimensions represent a space to understand work, resource, and planning as we scale the organization.

The z-axis of variability can be seen as a progression along the first four Cynefin domains (see sidebar). At the origin at lower left, we have predictable, small-grained work occurring in short “planning” horizons (e.g.,automated transactions running on computers). As we scale out to larger domains of work, longer timeframes, and greater variability in planning, we encounter the problems of growth, coordination, strategy, and the fundamental uncertainties of operating in a chaotic, competitive world.

Demand, supply, and execution

execution (n). 14c., “a carrying out, a putting into effect; enforcement; performance (of a law, statute, etc.)., the carrying out (of a plan, etc.).,” from Anglo-French execucioun (late 13c)., Old French execucion “a carrying out” (of an order, etc.)., from Latin executionem (nominative executio) “an accomplishing,” noun of action from past participle stem of exequi/exsequi “to follow out” (see execute).
— Online Etymology Dictionary

In order to understand the concept of execution, we need to think about supply versus demand. Think about the kinds of demand described above. Each form of demand implies some kind of supply to meet it. For example, the demand that an automated transaction is to be executed requires the supply of appropriate computing capacity at the necessary place and time. The demand that a new story is to be supported as part of a software product feature requires the supply of a software development team’s time and attention. And a major product or project requires the supply perhaps of many teams as well as other resources (hardware and software assets, for example).

In the Betz organization scaling cube, work and execution converge to the origin at front lower left. An alternate view that helps us describe the chapter structure is the convergence point at the top of a pyramid (see Demand-supply-execute model). This rotated approach is compatible with the dual-axis value chain.

Bottom to top, this diagram tells a story of demand and supply as they progress through increasingly refined understandings to the specific execution of work and delivery of value. We have markets and regulations, which define and constrain the potential demand for the digital product. Markets are met with capital funding, human resources, strategies and product offerings, which lead to programs of work, projects, and platform decisions. These, in turn, lead to identifying user stories, writing software, configuring platforms, and executing changes, service requests, and work tasks.

That finer and finer grained demand stream converges with a finer and finer grained supply stream. Large blocks of capital are translated into strategic technology choices and vendor relationships, organizational structures and investments in skilled people. More detailed budgets and planning culminate ultimately in the availability of people, hardware, and software for given assignments, e.g., an empty slot on a Kanban board. The journey can start anywhere, with a large block of traditionally managed programmatic capital or a small round of seed funding translated directly into a two-pizza team with maximum autonomy, which then grows and leads to larger investments.

dse model
Figure 132. Demand-supply-execute model

Ultimately the deployed IT service system is available for fulfilling transactional service demand which can be measured in terms of quality, availability, and performance. Execution, in this model, is defined as the irrevocable combination of demand with supply. The gap between the legs of the V is filled with the “Fog of Forecasting.” With the lower-level, larger grained abstractions it is more difficult to understand demand and supply, especially when product development (e.g.,novel software engineering) is involved. (Understanding the opportunities of large grained demand and matching those with significant supply is a strategy). As demand and supply converge to the point of execution, a finer and finer grained awareness is created of the impending work and whether it is likely to be successful — that is, if demand will effectively and efficiently be paired with supply.

Notice how the fog lifts as you get closer to actual execution. The closer we get to the point of execution, the better understanding we have of the team and individual level assignments across all queues and Kanban slots or their equivalent (e.g.,assigned and accepted work orders). Ultimately the demand represents the usage of the automated digital system’s capacity. Notice that in terms of the Betz cube z-axis, we still can have high variability at the point of execution, if we are considering a system executing many forms of work. In other words, surprises can happen at any time.

Part III chapter structure

The chapter structure of Part III can be visualized as in Part III Chapter structure.

Figure 133. Part III Chapter structure

In thinking about how organizations develop as they scale, it is helpful to consider various timeframes:

Ongoing execution is the actual day-to-day work, however, conceived. At this point in our journey, the work includes a significant challenge of coordination (which we define and analyze in Chapter 7). It can include defined process activities, project deliverables, the flow of new product functionality, or ongoing improvement and governance. Ongoing execution is the “moment of truth” where estimation becomes actualized, supply meets demand, and in our is increasing in complexity.

Investment decisions are required to charter programs, products, features, and components, even in those companies that may be moving away from a traditional project cycle. Investments are usually understood in terms of budget planning, which traditionally has driven the project funding cycle. They represent some statement of intent for a larger scope of work to be performed and/or sustained, based on the organizational platform, which should be able to support multiple investments.

Organizational changes may take years, and require considerable effort and thought if they are to succeed. One does not change organizational structures lightly or (hopefully) frequently. Employee tenure is in general even longer.

Culture takes longest and is most difficult to change; it easily outlasts both organizational forms and even individual employees coming and going. Both culture and organization have self-reinforcing feedback loops which add complexity to any deliberate attempts to transform them.

The delivery models

In Chapter 4 we introduced the "3 Ps":

  • Product management

  • Project management

  • Process management

It is important that you review them. Sometimes, the concept of “program” is used (see below). We will call these delivery models: they are organizing paradigms for getting work done. They may depend on each other, but they each have clear industry identity and bodies of knowledge associated with them:

  • Product management has the Product Development and Marketing association and authors like Steve Blank and Marty Cagan.

  • Program management has the AXELOS Managing Successful Programmes guidance

  • Project management has the Project Management Body of Knowledge and the PRINCE2 guidance from AXELOS.

  • Process management has the BPMN and BPEL standards and authors like Geary Rummler, Roger Burlton, and Paul Harmon.

Product versus program management

Program management is a term seen in government efforts and military contracting to describe major efforts of uncertain duration and (sometimes) uncertain outcome. Product management is also uncertain of duration and outcome, and the industry does not clearly distinguish between the two. Some companies use concepts of both product and program management; others use one or the other. Stanley Portny describes:

Program: This term can describe two different situations. First, a program can be a set of goals that gives rise to specific projects, but, unlike a project, a program can never be completely accomplished. For example, a health-awareness program can never completely achieve its goal (the public will never be totally aware of all health issues as a result of a health-awareness program), but one or more projects may accomplish specific results related to the program’s goal (such as a workshop on minimizing the risk of heart disease). Second, a program sometimes refers to a group of specified projects that achieve a common goal [214].

Where both terms are used, program management may be more about delivery and execution (shading into project management’s domain), while product management is more about vision and outcome.

We order the delivery models by their variability. What does that mean? Products and programs have the highest variability. Their outcome may differ considerably from the initial vision that drove them. Projects, in theory, should be reasonably plannable -— their schedule and cost are managed in terms of “plan versus actual” and differences, ideally, should be well controlled and understandable. Finally, process management strives to minimize variation, and in its most rigorous form uses statistical control to do so. If we matrix the delivery models with the timeframes we get Timeframes and delivery models.

delivery models and time frames
Figure 134. Timeframes and delivery models

The relationships between the timeframes and delivery models are complex:

Investments are made in products first, which may or may not need projects and/or processes. Rigorously planned projects or detailed, repeatable processes are not, in fact, optimal for product discovery — a mistake the IT profession has fallen into over and over again. Products are best thought of in terms of discovery and empirical hypothesis-testing. If the hypothesis fails, the investment should be cancelled. So, the “product” concept is both shorter and longer lived than the average project, which is typically understood on an annual cycle.

Project management also may take place without processes, as it may be based on one-time “deliverables” that are not repeatedly produced.

Finally, to support a process requires portfolio investment and organizational structure, but no project may ever be involved. Whether a product is implied by the existence of a process is an interesting question we will think about.

Clearly, we must think carefully about the relationships between these dimensions. That, in a nutshell, is the purpose of part III.

Instructor’s note

We are inverting the usual plan-execute order on purpose, starting with execution and expanding from there. This inversion challenges the too-common assumption of “plan, then execute” (alternatively seen as “plan-build-run”). We discuss longer-horizon planning after we discuss execution because we must keep execution alive at all costs and cannot afford to shut it down while we go off and make plans for our new larger scale.

The demand-supply-execute model’s origins and thought process can be seen in a series of 4 blog posts starting with

7. Coordination

7.1. Introduction

coordination (n). co-ordination, c. 1600, “orderly combination,” from French coordination (14c). or directly from Late Latin coordinationem (nominative coordinatio), noun of action from past participle stem of Latin coordinare “to set in order, arrange,” from com— “together” (see com-) + ordinatio “arrangement,” from ordo “row, rank, series, arrangement” (see order (n)). Meaning “action of setting in order” is from the 1640s; that of “harmonious adjustment or action,” especially of muscles and bodily movements, is from 1855.
— Online Etymology Dictionary
Agile software development methods were particularly designed to deal with change and uncertainty, yet they de-emphasize traditional coordination mechanisms such as forward planning, extensive documentation, specific coordination roles, contracts, and strict adherence to a pre-specified process [259].
— Diane E. Strode et al.
Coordination in co-located agile software development projects

Growth is presenting us with many challenges. But we can’t stop too long and think about how to handle it. We have to continue executing, as we scale up. The problem is like changing the tires on a moving car. It’s not easy.

We’ve been executing our objectives since our first day in the garage. As noted above, execution is whenever we meet demand with supply. An idea for a new feature, to deliver some digital value, is demand. The time we spend implementing the feature is supply. The two combined is execution. Sometimes it goes well; sometimes it doesn’t. Maintaining a tight feedback loop to continually assess our execution is essential.

As we grow into multiple teams and multiple products, we have more complex execution problems, requiring coordination. The fundamental problem is the “D-word:” dependency. Dependencies are why we coordinate (work with no dependencies can scale nicely along the AKF x-axis). But when we have dependencies (and there are various kinds) we need a wider range of techniques. Our one Kanban board is not sufficient to the task.

We need to consider the delivery models, as well (the “3 Ps": product, project, process, and now we’ve added program management). Decades of industry practice mean that people will tend to think in terms of these models and unless we are clear in our discussions about the nature of our work we can easily get pulled into non-value-adding arguments. To help our understanding, we’ll take a deeper look at process management, continuous improvement, and their challenges.

Instructor’s note on learning progression

The structure of Part III may be counter-intuitive. Usually, we think in terms of “plan, then execute.” However, this can lead to a waterfall and deterministic assumptions. Starting the discussion with execution reflects the fact that a scaling company does not have time to “stop and plan.” Rather, planning emerges on top of the ongoing execution of the firm, in the interest of controlling and directing that execution across broader timeframes and larger scopes of work.

7.1.1. Chapter overview

In this section, we will cover:

  • Defining coordination

    • Coordination & dependencies

    • Concepts and techniques

    • Coordination effectiveness

  • Coordination, execution, and the delivery models

    • Product management and coordination

    • Project management as coordination

    • Process management as coordination

  • A deeper examination of process management

  • Process control and continuous improvement

There is a discussion of business process modeling fundamentals in the appendix.

7.1.2. Chapter learning objectives

  • Identify and describe dependencies, coordination, and their relationship

  • Describe the relationship of delivery models to coordination

  • Describe process management and its strengths and weaknesses as a coordination mechanism

  • Identify the problems of process proliferation with respect to execution and demand

  • Identify key individuals and themes in the history of continuous improvement

  • Describe the applicability of statistical process control to different kinds of processes

7.2. Defining coordination

7.2.1. Example: Scaling one product

Good team structure can go a long way toward reducing dependencies but will not eliminate them.
— Mike Cohn
Succeeding with Agile
What’s typically underestimated is the complexity and indivisibility of many large-scale coordination tasks.
— Gary Hamel
preface to the Open Organization: Igniting Passion and Performance
Figure 135. Multiple feature teams, one product

We’ve defined execution as the point at which supply and demand are combined, and of course, we’ve been executing since the start of our journey. Now, however, we are executing in a more complex environment; we have started to scale along the AKF scaling cube y-axis, and we have either multiple teams working on one product and/or multiple products. Execution becomes more than just “pull another story off the Kanban board.” As multiple teams are formed (see Multiple feature teams, one product), dependencies arise, and we need coordination. The term "architecture” is likely emerging through these discussions. (We will discuss organizational structure directly in Chapter 9, and architecture in Chapter 12).

As noted in the discussion of Amazon’s product strategy, some needs for coordination may be mitigated through the design of the product itself. This is why APIs and microservices are popular architecture styles. If the features and components have well defined protocols for their interaction and clear contracts for matters like performance, development on each team can move forward with some autonomy.

But at scale, complexity is inevitable. What happens when a given business objective requires a coordinated effort across multiple teams? For example, an online e-commerce site might find itself overwhelmed by business success. Upgrading the site to accommodate the new demand might require distinct development work to be performed by multiple teams (see Coordinated initiative across timeframes).

As the quote from Gary Hamel above indicates, a central point of coordination and accountability is advisable. Otherwise, the objective is at risk. (It becomes “someone else’s problem.”) We will return to the investment and organizational aspects of multi-team and multi-product scaling in Chapters 8 and 9. For now, we will focus on dependencies and operational coordination.

Figure 136. Coordinated initiative across timeframes

7.2.2. A deeper look at dependencies

  1. coordination can be seen as the process of managing dependencies among activities.

What is a "dependency"? We need to think carefully about this. According to the definition above (from [178]), without dependencies, we do not need coordination. (We’ll look at other definitions of coordination in the next two chapters). Diane Strode and her associates [259] have described a comprehensive framework for thinking about dependencies and coordination, including a dependency taxonomy, an inventory of coordination strategies, and an examination of coordination effectiveness criteria.

To understand dependencies, Strode et al. [260] propose the framework shown in Dependency taxonomy (from Strode) [60].

Table 13. Dependency taxonomy (from Strode)
Type Dependency Description

Knowledge. A knowledge dependency occurs when a form of information is required in order for progress.


Domain knowledge or a requirement is not known and must be located or identified.


Technical or task information is known only by a particular person or group.

Task allocation

Who is doing what, and when, is not known.


Knowledge about past decisions is needed.

Task. A task dependency occurs when a task must be completed before another task can proceed.


An activity cannot proceed until another activity is complete.

Business process

An existing business process causes activities to be carried out in a certain order.

Resource. A resource dependency occurs when an object is required for progress.


A resource (person, place or thing) is not available.


A technical aspect of development affects progress, such as when one software component must interact with another software component.

We can see examples of these dependencies throughout digital products. In the next section, we will talk about coordination techniques for managing dependencies.

7.2.3. Organizational tools and techniques

Where leveraging yellow stickies or index cards makes sense in conjunction with practices like big visible charts and co-location, such formats become ridiculous for a large constituency of challenging projects. When faced with these challenges, rather than proclaim that Agile won’t work or doesn’t scale, the preferable approach is to understand and acknowledge the nature of collaboration, the nature of distributed workflow, and the complexity of modern product development.
— Mark Kennaley
SDLC 3.0

Our previous discussion of work management was a simple, idealized flow of uniform demand (new product functionality, issues, etc.). Tasks, in general, did not have dependencies, or dependencies were handled through ad hoc coordination within the team. We also assumed that resources (people) were available to perform the tasks; resource contention, while it certainly may have come up, was again handled through ad hoc means. However, as we scale, simple Kanban and visual Andon are no longer sufficient, given the nature of the coordination we now require. We need a more diverse and comprehensive set of techniques.

The discussion of particular techniques is always hazardous. People will tend to latch on to a promising approach without full understanding. As noted by Craig Larman, the risk is one of cargo cult thinking in your process adoption [168 p. 44]. In Chapter 9 we will discuss the Mike Rother book Toyota Kata. Toyota does not implement any procedural change without fully understanding the “target operating condition” -— the nature of the work and the desired changes to it.

As we scale up, we see that dependencies and resource management have become defining concerns. However, we retain our Lean Product Development concerns for fast feedback and adaptability, as well as a critical approach to the idea that complex initiatives can be precisely defined and simply executed through open-loop approaches. In this section, we will discuss some of the organizational responses (techniques and tools) that have emerged as proven responses to these emergent issues.

The table Coordination taxonomy (from Strode) uses the concept of artifact, which we introduced in Chapter 5. For our purposes here, an artifact is a representation of some idea, activity, status, task, request, or system. Artifacts can represent or describe other artifacts. Artifacts are frequently used as the basis of communication.

Strode et al also provide a useful framework for understanding coordination mechanisms, excerpted and summarized into Coordination taxonomy (from Strode) [61].

Sidebar: Cargo cult thinking

Processes and practices are always at risk of being used without full understanding. This is sometimes called cargo cult thinking. What is a cargo cult?

During World War II, South Pacific native peoples had been exposed abruptly to modern technological society with the Japanese and U.S. occupations of their islands. Occupying forces would often provide food, tobacco, and luxuries to the natives to ease relations. After the war, various tribes were observed creating simulated airports and airplanes, and engaging in various rituals that superficially looked like air traffic signaling and other operations associated with a military air base.

On further investigation, it became clear that the natives were seeking more “cargo” and had developed a magical understanding of how goods would be delivered. By imitating the form of what they had seen, they hoped to recreate it.

In 1974, the noted physicist Richard Feynman gave a speech at Caltech in which he coined the immortal phrase “cargo cult science” [90]. His intent was to caution against activities which appear to follow the external form of science, but lack the essential understanding at its core. Similar analogies are seen in business and IT management, as organizations adopt tools and techniques because they have seen others do so, without having fundamental clarity about the problems they are trying to solve and how a given technique might specifically help.

As with many stories of this kind, there are questions about the accuracy of the original anthropological accounts and Western interpretations and mythmaking around what was seen. However, there is no question that “cargo cult thinking” is a useful cautionary metaphor.

Table 14. Coordination taxonomy (from Strode)
Strategy Component Definition



Physical closeness of individual team members.


Team members are continually present and able to respond to requests for assistance or information


Team members are able to perform the work of another to maintain time schedules


Synchronization activity

Activities performed by all team members simultaneously that promote a common understanding of the task, process, and or expertise of other team members

Synchronization artifact

An artifact generated during synchronization activities.

Boundary spanning

Boundary spanning activity

Activities (team or individual) performed to elicit assistance or information from some unit or organization external to the project

Boundary spanning artifact

An artifact produced to enable coordination beyond the team and project boundaries.

Coordinator role

A role taken by a project team member specifically to support interaction with people who are not part of the project team but who provide resources or information to the project.

The following sections expand the three strategies (structure, synchronization, boundary spanning) with examples.


Don Reinertsen proposes “The Principle of Colocation” which asserts that “Colocation improves almost all aspects of communication” [221 p. 230]. In order to scale this beyond one team, one logically needs what Mike Cohn calls “The Big Room” [67 p. 346].

In terms of communications, this has significant organizational advantages. Communications are as simple as walking over to another person’s desk or just shouting out over the room. It is also easy to synchronize the entire room, through calling for everyone’s attention. However, there are limits to scaling the “Big Room” approach:

  • Contention for key individual’s attention

  • “All hands” calls for attention that actually interests only a subset of the room

  • Increasing ambient noise in the room

  • Distracting individuals from intellectually demanding work requiring concentration, driving multi-tasking and context-switching, and ultimately interfering with their personal sense of flow — a destructive outcome. (See [74] for more on flow as a valuable psychological state).

The tension between team coordination and individual focus will likely continue. It is an ongoing topic in facilities design.


If the team cannot work all the time in one room, perhaps they can at least be gathered periodically. There is a broad spectrum of synchronization approaches:

  • Ad hoc chats (in person or virtual)

  • Daily standups (e.g.,from Scrum)

  • Weekly status meetings

  • Coordination meetings (e.g.,Scrum of Scrums, see below)

  • Release kickoffs

  • Quarterly “all-hands” meetings

  • Cross-organizational advisory and review boards

  • Open Space inspired “unmeetings” and “unconferences”

All of them are essentially similar in approach and assumption: build a shared understanding of the work, objectives, or mission among smaller or larger sections of the organization, through limited-time face to face interaction, often on a defined time interval.

Cadenced approaches. When a synchronization activity occurs on a timed interval, this can be called a cadence. Sometimes, cadences are layered; for example, a daily standup, a weekly review, and a monthly Scrum of Scrums. Reinertsen calls this harmonic cadencing [221 pp. 190-191]. Harmonic cadencing (monthly, quarterly, and annual financial reporting) has been used in financial management for a long time.

Boundary spanning
The philosophy is that you push the power of decision making out to the periphery and away from the center. You give people the room to adopt, based on their experiences and expertise. All you ask is that they talk to one another and take responsibility. That is what works.
— Atul Gawande
The Checklist Manifesto

Examples of boundary-spanning liaison and coordination structures include:

  • Shared team members

  • Integration teams

  • Coordination roles

  • Communities of practice

  • Scrum of scrums

  • Submittal schedules

  • API standards

  • RACI/ECI decision rights

Shared team members are suggested when two teams have a persistent interface requiring focus and ownership. When a product has multiple interfaces that emerge as a problem requiring focus, an integration team may be called for. Coordination roles can include project and program managers, release train conductors, and the like. Communities of practice will be introduced in Chapter 9 when we discuss the Spotify model. Considered here, they may also play a coordination role as well as a practice development/maturity role.

Finally, the idea of a Scrum of Scrums is essentially a representative or delegated model, in which each Scrum team sends one individual to a periodic coordination meeting where matters of cross-team concern can be discussed and decisions made [67], Chapter 17.

Cohn cautions: “A scrum of scrums meeting will feel nothing like a daily scrum despite the similarities in names. The daily scrum is a synchronization meeting: individual team members come together to communicate about their work and synchronize their efforts. The scrum of scrums, on the other hand, is a problem-solving meeting and will not have the same quick, get-in-get-out tone of a daily scrum [67 p. 342].”

Another technique mentioned in The Checklist Manifesto [105] is the submittal schedule. Some work, while detailed, can be planned to a high degree of detail (i.e. the “checklists” of the title). However, emergent complexity requires a different approach — no checklist can anticipate all eventualities. In order to handle all the emergent complexity, the coordination focus must shift to structuring the right communications. In examining modern construction industry techniques, Gawande noted the concept of the “submittal schedule,” which “didn’t specify construction tasks; it specified communication tasks” (p. 65, emphasis supplied). With the submittal schedule, the project manager tracks that the right people are talking to each other to resolve problems — a key change in focus from activity-centric approaches.

We have previously discussed APIs in terms of Amazon's product strategy. They are also important as a product scales into multiple components and features; API standards can be seen as a boundary-spanning mechanism.

The above discussion is by no means exhaustive. A wealth of additional techniques relevant for digital professionals is to be found in [168, 67]. New techniques are continually emerging from the front lines of the digital profession; the interested student should consider attending industry conferences such as those offered by the Agile Alliance.

In general, the above approaches imply synchronized meetings and face to face interactions. When the boundary-spanning approach is based on artifacts (often a requirement for larger, decentralized enterprises), we move into the realms of process and project management. Approaches based on routing artifacts into queues often receive criticism for introducing too much latency into the product development process. When artifacts such as work orders and tickets are routed for action by independent teams, prioritization may be arbitrary (not based on business value, e.g., cost of delay). Sometimes the work must flow through multiple queues in an uncoordinated way. Such approaches can add dangerous latency to high-value processes, as we warned in Chapter 5. We will look in more detail at process management in a future section.

The reality of microservices and loose coupling: the case of Chubby at Google

The Agile manifesto idea that architecture can “emerge” without explicit coordination (e.g.,across a set of smaller grained services (“microservices”) is attractive. Coordinating designs across products, services, features and/or components is expensive, and ideally, products should be able to evolve independently.

Mike Burrows of Google provides a detailed description of the Chubby lock service [46], which is a prototypical example of a broadly-available internal service usable by a wide variety of other products.

The purpose of a lock service is to “allow its clients to synchronize their activities and to agree on basic information about their environment.” Chubby was built from the start with objectives of reliability, availability to a “moderately large set of clients,” and ease of understanding. Burrows notes that even with such a cohesive and well-designed internal service, they still encounter coordination problems requiring human intervention. Such problems include:

  • Use (“abuse”) in unintended ways by clients

  • Invalid assumptions by clients regarding Chubby’s availability

Because of this, the Chubby team (at least at the time writing of the case study) instituted a review process when new clients wished to start using the lock manager. In terms of this chapter’s topic, this means that someone on the product team needed to coordinate the discussions with the Chubby team and ensure that any concerns were resolved. This might conceivably have involved multiple iterations and reviews of designs describing intended use.

In short, even the most sophisticated microservice environments may have a dependency on human coordination across the teams.

7.2.4. Coordination effectiveness

Diane Strode and her colleagues propose that coordination effectiveness can be understood as the following taxonomy:

  • Implicit

    • Knowing why (shared goal)

    • Know what is going on and when

    • Know what to do and when

    • Know who is doing what

    • Know who knows what

  • Explicit

    • Right place

    • Right thing

    • Right time

Coordinated execution means that teams have a solid common ground of what they are doing and why, who is doing it, when to do it, and where to go for information. They also have the material outcomes of the right people being in the right place doing the right thing at the right time. These coordination objectives must be achieved with a minimum of waste, and with a speed supporting an OODA loop tighter than the competition’s. Indeed, this is a tall order!

7.3. Coordination, execution, and the delivery models

If we take the strategies proposed by Strode et al. and think of them as three, orthogonal dimensions, we can derive another useful 3-dimensional figure (Cube derived from Strode):

  • Projects often are used to create and deploy processes. A large system implementation (e.g.,of a Enterprise Resource Planning module such as Human Resource Management) will often be responsible for process implementation including training.

  • As environments mature, product and/or project teams require process support.

Strode Cube
Figure 137. Cube derived from Strode
  • At the origin point, we have practices like face to face meetings at at various scales.

  • Geographically distant, immediate coordination is achieved with messaging and other forms of telecommunications.

  • Co-located but ansynchronous coordination is achieved through shared artifacts like Kanban boards.

  • Distant and asynchronous coordination again requires some kind of telecommunications

The Z-axis is particularly challenging, as it represents scaling from a single to multiple and increasingly organizationally distant teams. Where a single team may be able to maintain a good sense of common ground even when geographically distant, or working asynchronously, adding the third dimension of organizational boundaries is where things get hard. Larger-scale coordination strategies include:

  • Operational digital processes (Chapter 6)

    • Change management

    • Incident management

    • Request management

    • Problem management

    • Release management

  • Specified decision rights

  • Projects and project managers (Chapter 8)

  • Shared services and expertise (Chapter 8)

  • Organization structures (Chapter 9)

  • Cultural norms (Chapter 9)

  • Architecture standards (Chapters 11 and 12)

All of these coping mechanisms risk compromising to some degree the effectiveness of co-located, cross-functional teams. Remember that the high-performing product team is likely the highest-value resource known to the modern organization. Protecting the value of this resource is critical as the organization scales up. The challenge is that models for coordinating and sustaining complex digital services are not well understood. IT organizations have tended to fall back on older supply-chain thinking, with waterfall-derived ideas that work can be sequenced and routed between teams of specialists. (More on this to come in Chapter 9).

We recommend you review the definitions of the “3 P’s": product, project, and process management.

7.3.1. Product management release trains

Where project and process management are explicitly coordination-oriented, product management is broader and focused on outcomes. As noted previously, it might use either a project or a process management to achieve its outcomes, or it might not.

Release management was introduced in Part I, and has remained a key concept we’ll return to now. Release management is a common coordination mechanism in product management, even in environments that don’t otherwise emphasize processes or projects. At scale, the concept of a “release train” is seen. The Scaled Agile Framework considers it the “primary value delivery construct” [235].

The train is a cadenced synchronization strategy. It “departs the station” on a reliable schedule. As with Scrum, date and quality are fixed, while the scope is variable. The Scaled Agile Framework (SAFe) emphasizes that “being on the train” in general is a full-time responsibility, so the train is also a temporary organizational or programmatic concept. The release train “engineer” or similar role is an example of the coordinator role seen in the Strode coordination tools and techniques matrix.

The release train is a useful concept for coordinating large, multi-team efforts, and is applicable in environments that have not fully adopted Agile approaches. As author Joanna Rothman notes, “You can de-scope features from a particular train, but you can never allow a train to be late. That helps the project team focus on delivering value and helps your customer become accustomed to taking the product on a more frequent basis” [230].

7.3.2. Project management as coordination

We’ll talk about project management as an investment strategy in the next chapter. In this chapter, we look at it as a coordination strategy. If you are completely unfamiliar with project management, see the brief introduction in the appendix and the PMBOK summary.

Project management adds concerns of task ordering and resource management, for efforts typically executed on a one-time basis. Project management groups together a number of helpful coordination tools which is why it is widely used. These tools include:

  • Sequencing tasks

  • Managing task dependencies

  • Managing resource dependencies of tasks

  • Managing overall risk of interrelated tasks

  • Planning complex activity flows to complete at a given time

However, project management also has a number of issues:

  • Projects are by definition temporary, while products may last as long as there is market demand.

  • Project management methodology, with its emphasis on predictability, scope management, and change control often conflicts with the product management objective of discovering information (see the discussion of Lean Product Development).

(But not all large management activities involve the creation of new information! Consider the previous example of upgrading the RAM in 80,000 POS terminals in 2000 stores).

The project paradigm has a benefit in its explicit limitation of time and money, and the sense of urgency this creates. In general, scope, execution, limited resources, deadlines, and dependencies exist throughout the digital business. A product manager with no understanding of these issues, or tools to deal with them, will likely fail. Product managers should, therefore, be familiar with the basic concepts of project management. However, the way in which project management is implemented, the degree of formality, will vary according to need.

A project manager may still be required, to facilitate discussions, record decisions, and keep the team on track to its stated direction and commitments. Regardless of whether the team considers itself “Agile,” people are sometimes bad at taking notes or being consistent in their usage of tools such as Kanban boards and standups.

It is also useful to have a third party who is knowledgeable about the product and its development, yet has some emotional distance from its success. This can be a difficult balance to strike, but the existence of the role of Scrum coach is indicative of its importance.

We will take another look at project management, as an investment management approach, in Chapter 8.

7.3.3. Decision rights

There are two ways of communicating boundaries inside the organization. Unfortunately, the most common approach is trial and error. Some people refer to this as discovering the location of the invisible electric fences. There is a better approach for setting boundaries. It consists of clarifying expectations for the entire team regarding their authority to make decisions. This is done by making a list of specific decisions and discussing them with the functional managers and the team. A two-hour meeting early in the program can have enormous impact on clarifying intended operating practices to the team [220 pp. 106-108].
— Don Reinertsen
Managing the Design Factory

Approvals are a particular form of activity dependency, and since approvals tend to flow upwards in the organizational hierarchy to busy individuals, they can be a significant source of delay and as Reinertsen points out [220 p. 108], discovering “invisible electric fences” by trial and error is both slow and also reduces human initiative. One boundary spanning coordination artifact an organization can produce as a coordination response is a statement of decision rights, for example, a RACI analysis. RACI stands for

  • Responsible

  • Accountable (sometimes understood as Approves)

  • Consulted

  • Informed

A RACI analysis is often used when accountability must be defined for complex activities. It is used in process management, and also is seen in project management and general organizational structure.

Table 15. RACI analysis
Team member Product owner Chief product owner

Change interface affecting two modules




Change interface affecting more than two modules




Hire new team member




Some agile authors [62] call for an “ECI” matrix, with the “E” standing for empowered, defined as both Accountable and Responsible.

7.3.4. Process management as coordination

We discussed the emergence of process management in Chapter 5, and in Chapter 6 the basic digital processes of Change, Incident, Problem, and Request management. You should also review the process modeling overview in the appendix.

As we saw in the Strode dependency taxonomy, waiting on a business process is a form of dependency. But business processes are more than just dependency sources and obstacles; they themselves are a form of coordination. In Strode’s terms, they are a boundary spanning activity. It is ironic that a coordination tool itself might be seen as a dependency and blockage to work; this shows at least the risk of assuming that all problems can or should be solved by tightly-specified business processes!

Like project management, process management is concerned with ordering, but less so with the resource load (more on this to come), and more with repeatability and ongoing improvement. The concept of process is often contrasted with that of function or organization. Process management’s goal is to drive repeatable results across organizational boundaries. As we know from our discussion of product management, developing new products is not a particularly repeatable process. The Agile movement arose as a reaction to mis-applied process concepts of "repeatability” in developing software. These concerns remain. However, this book covers more than development. We are interested in the spectrum of digital operations and effort that spans from the unique to the highly repeatable. There is an interesting middle ground, of processes that are at least semi-repeatable. Examples often found in the large digital organization include:

  • Assessing, approving, and completing changes

  • End user equipment provisioning

  • Resolving incidents and answering user inquiries

  • Troubleshooting problems

We will discuss a variety of such processes, and the pros and cons of formalizing them, in the Chapter 9 section on industry frameworks. In Chapter 10, we will discuss IT governance in depth. The concept of “control” is critical to IT governance, and processes often play an important role in terms of control.

Just as the traditional IT project is under pressure, there are similar challenges for the traditional IT process. DevOps and continuous delivery are eroding the need for formal change management. Consumerization is challenging traditional internal IT provisioning practices. And self-service help desks are eliminating some traditional support activities. Nevertheless, any rumors of an “end to process” are probably greatly exaggerated. Measurability remains a concern; the Lean philosophy underpinning much Agile thought emphasizes measurement. There will likely always be complex combinations of automated, semi-automated, and manual activity in digital organizations. Some of this activity will be repeatable enough that the “process” construct will be applied to it.

7.3.5. Projects and processes

Project management and process management interact in two primary ways as Process and project illustrates:

  • Projects often are used to create and deploy processes. A large system implementation (e.g.,of a Enterprise Resource Planning module such as Human Resource Management) will often be responsible for process implementation including training.

  • As environments mature, product and/or project teams require process support.

process and project
Figure 138. Process and project

As Richardson notes in Project Management Theory and Practice, “there are many organizational processes that are needed to optimally support a successful project.” [222] For example, the project may require predictable contractor hiring, or infrastructure provisioning, or security reviews. The same is true for product teams that may not be using a “project” concept to manage their work. To the extent these are managed as repeatable, optimized processes, the risk is reduced. The trouble is when the processes require prioritization of resources to perform them. This can lead to long delays, as the teams performing the process may not have the information to make an informed prioritization decision. Many IT professionals will agree that the relationship between application and infrastructure teams has been contentious for decades because of just this issue. One response has been increasing automation of infrastructure service provisioning (private and external cloud).

7.4. Why process management?

Is a firm a collection of activities or a set of resources and capabilities? Clearly, a firm is both. But activities are what firms do, and they define the resources and capabilities that are relevant.
— Michael Porter
Competitive Advantage
Individuals and interactions over processes and tools.
— Manifesto for Agile Software Development defines process as “a systematic series of actions directed to some end . . . a continuous action, operation, or series of changes taking place in a definite manner.” We saw the concept of “process” start to emerge in Chapter 5, as work become more specialized and repeatable and our card walls got more complicated.

We’ve discussed work management, which is an important precursor of process management. Work management is less formalized; a wide variety of activities are handled through flexible Kanban-style boards or “card walls” based on the simplest “process” of:

  • To do

  • Doing

  • Done

However, the simple card wall/Kanban board can easily become much more complex, with the addition of swimlanes and additional columns, including holding areas for blocked work. As we discussed in Chapter 5, when tasks become more routine, repeatable, and specialized, formal process management starts to emerge. Process management starts with the fundamental capability for coordinated work management, and refines it much further.

Process, in terms of “business process,” has been a topic of study and field for professional development for many years. Pioneering Business Process Management (BPM) authors such as Michael Hammer [115], and Geary Rummler [234] have had an immense impact on business operations, with concepts such as Hammer’s Business Process Re-engineering (BPR). BPR initiatives are intended to identify waste and streamline processes, eliminating steps that no longer add value. BPR efforts often require new or enhanced digital systems.

In the Lean world, value stream mapping represents a way of understanding the end to end flow of value, typically in a manufacturing or supply chain operational context [229]. Toyota considers a clear process vision, or “target condition,” to be the most fundamental objective for improving operations [228], Chapters 5 and 6. Designing processes, improving them, and using them to improve overall performance is an ongoing activity in most, if not all organizations.

In your company, work has been specializing. A simple card-based Kanban approach is no longer sufficient. You are finding that some work is repetitive, and you need to remember to do certain things in certain orders. For example, a new Human Resources (HR) manager was hired and decided that a sticky note of “hire someone new for us” was not sufficient. As she pointed out to the team, hiring employees was a regulated activity, with legal liability, requiring confidentiality, that needed to follow a defined sequence of tasks:

  • Establishing the need and purpose for the position

  • Soliciting candidates

  • Filtering the candidates

  • Selecting a final choice

  • Registering that new employee in the payroll system

  • Getting the new employee set up with benefits providers (insurance, retirement, etc.).

  • Getting the new employee working space, equipment, and access to systems

  • Training the new employee in organizational policies, as well as any position-specific orientation

The sales, marketing, and finance teams have similarly been developing ordered lists of tasks that are consistently executed as end-to-end sequences. And even in the core digital development and operations teams, they are finding that some tasks are repetitive and need documentation so they are performed consistently.

Your entire digital product pipeline may be called a “process.” From initial feature idea through production, you seek a consistent means for identifying and implementing valuable functionality. Sometimes this work requires intensive, iterative collaboration and is unpredictable (e.g.,developing a user interface); sometimes, the work is more repeatable (e.g.,packaging and releasing tested functionality).

You’re hiring more specialized people with specialized backgrounds. Many of them enter your organization and immediately ask process questions:

  • What’s your Security process?

  • What’s your Architecture process?

  • What’s your Portfolio process?

You’ve not had these conversations before. What do they mean by these different “processes?” They seem to have some expectation based on their previous employment, and if you say “we don’t have one” they tend to be confused. You’re becoming concerned that your company may be running some risk, although you also are wary that “process” might mean slowing things down, and you can’t afford that.

However, some team members are cautious of the word “process." The term "process police” comes up in an unhappy way.

“Are we going to have auditors tracking whether we filled out all our forms correctly? ” one asks.

“We used to get these 'process consultants' in at my old company, and they would leave piles of 3 ring binders on the shelf that no one ever looked at,” another person says.

“I can’t write code according to someone’s recipe! ” a third says with some passion, and to general agreement from the developers in the room.

The irony is that digital products are based on process automation. The idea that certain tasks can be done repeatably and at scale through digitization is fundamental to all use of computers. The digital service is fundamentally an automated process, one that can be intricate and complicated. That’s what computers do. But, process management also spans human activities, and that’s where things get more complex.

Processes are how we ensure consistency, repeatability, and quality. You get expected treatment at banks, coffee shops, and dentists because they follow processes. IT systems often enable processes – a mortgage origination system is based on IT software that enforces a consistent process. IT management itself follows certain processes, for all the same reasons as above.

However, processes can cause problems. Like project management, the practice of process management is under scrutiny in the new Lean and Agile-influenced digital world. Processes imply queues, and in digital and other product development-based organizations, this means invisible work in process. For every employee you hire who expects you to have processes, another will have bad process experiences at previous employers. Nevertheless, process remains an important tool in your toolkit for organization design.

Process is a broad concept used throughout business operations. The coverage here is primarily about process as applied to the digital organization. There is a bit of a recursive paradox here; in part, we are talking about the process by which business processes are analyzed and sometimes automated. By definition, this overall “process” (you could call it a meta-process) cannot be made too prescriptive or predictable.

The concept of “process” is important and will persist through digital transformation. We need a robust set of tools to discuss it. This chapter will break the problem into a lifecycle of:

  • Process conception

  • Process content

  • Process execution

  • Process improvement

Although we don’t preface these topics with “Agile” or “Lean,” bringing these together with related perspectives is the intent of this chapter.

7.4.1. Process conception

“Many companies have at least one dysfunctional area. This may be the “furniture police” who won’t let programmers rearrange furniture to facilitate pair programming. Or it may be a purchasing group that takes six weeks to process a standard software order. In any event, these types of insanity get in the way of successful projects.”
— Mike Cohn
Succeeding with Agile: Software Development Using Scrum

Processes can generate various emotional reactions:

“Dysfunctional! Insanity!” (as above)

“Follow the process!”

“What bureaucracy!”

“Don’t create the 'Process Police'!”

“I am an IT Service Management professional. I believe in the ITIL framework!”

“I don’t write code on an assembly line!”

Such reactions are commonplace in social media (and even well-regarded professional books), but we need a more objective and rational approach to understand the pros and cons of processes. We have seen a number of neutral concepts towards this end from authors such as Don Reinertsen and Diane Strode:

A process is a technique, a tool, and no technique should be implemented without a thorough understanding of the organizational context. Nor should any technique be implemented without rigorous, disciplined follow-up as to its real effects, both direct and indirect. Many of the issues with process come from a cultural failure to seek an understanding of the organization needs in objective terms such as these. We’ll think about this cultural failure more in the Chapter 9 discussion of Toyota Kata.

A skeptical and self-critical, “go and see” approach is, therefore, essential. Too often, processes are instituted in reaction to the last crisis, imposed top down, and rarely evaluated for effectiveness. Allowing affected parties to lead a process re-design is a core Lean principle (kaizen). On the other hand, uncoordinated local control of processes can also have destructive effects as discussed below.

7.4.2. Process execution

Since our initial discussions in Chapter 5 on Work Management, we find ourselves returning full circle. Despite the various ways in which work is conceived, funded, and formulated, at the end “it’s all just work.” The digital organization must retain a concern for the “human resources” (that is, people) who find themselves at the mercy of:

The Lean movement manages through minimizing waste and over-processing. This means both removing un-necessary steps from processes, AND eliminating un-necessary processes completely when required. Correspondingly, the processes that remain should have high levels of visibility. They should be taken with the utmost seriousness, and their status should be central to most people’s awareness. (This is the purpose of Andon.

From workflow tools to collaboration and digital exhaust. One reason process tends to generate friction and be unpopular is the poor usability of workflow tools. Older tools tend to present myriads of data fields to the user and expect a high degree of training. Each state change in the process is supposed to be logged and tracked by having someone sign in to the tool and update status manually.

By contrast, modern workflow approaches take full advantage of mobile platforms and integration with technology like chat rooms and ChatOps. Mobile development imposes higher standards for user experience (UX) design, which makes tracking workflow somewhat easier. Integrated software pipelines that integrate application lifecycle management and/or project management with source control and build management are increasingly gaining favor. For example:

  1. A user logs a new feature request in the Application Lifecycle Management (ALM) tool

  2. When the request is assigned to a developer, the tool automatically creates a feature branch in the source control system for the developer to work on

  3. The developer writes tests and associated code and merges changes back to the central repository once tests are passed successfully

  4. The system automatically runs build tests

  5. The ALM tool is automatically updated accordingly with completion if all tests pass

See also the previous discussion of ChatOps, which similarly combines communication and execution in a low-friction manner, while providing rich digital exhaust as an audit trail.

In general, the idea is that we can understand digital processes not through painful manual status updates, but rather through their digital exhaust — the data byproducts of people performing the value-add day-to-day work, at speed and with the flow instead of constant delays for approvals and status updates.

7.4.3. Measuring process

One of the most important reasons for repeatable processes is so that they can be measured and understood. Repeatable processes are measured in terms of:

  • Speed

  • Effort

  • Quality

  • Variation

  • Outcomes

at the most general level, and of course, all of those measurements must be defined much more specifically depending on the process. Operations (often in the form of business processes) generate data, and data can be aggregated and reported on. Such reporting serves as a form of feedback for management and even governance. Examples of metrics might include:

  • Quarterly sales as a dollar amount

  • Percentage of time a service or system is available

  • Number of successful releases or pushes of code (new functionality)

Measurement is an essential aspect of process management but must be carefully designed. Measuring process can have unforeseen results. Process participants will behave according to how the process is measured. If a help desk operator is measured and rated on how many calls they process an hour, the quality of those interactions may suffer. It is critical that any process “key performance indicator” be understood in terms of the highest possible business objectives. Is the objective truly to process as many calls as possible? Or is it to satisfy the customer so they need not turn to other channels to get their answers?

A variety of terms and practices exist in process metrics and measurement, such as:

  • The Balanced Scorecard

  • The concept of a metrics hierarchy

  • Leading versus lagging indicators

The Balanced Scorecard is a commonly-seen approach for measuring and managing organizations. First proposed by Kaplan and Norton [148] in the Harvard Business Review, the Balanced Scorecard groups metrics into the following subject areas:

  • Financial

  • Customer

  • Internal business processes

  • Learning and growth

Metrics can be seen as “lower” versus “higher” level. For example, the metrics from a particular product might be aggregated into a hierarchy with the metrics from all products, to provide an overall metric of product success. Some metrics are perceived to be of particular importance for business processes, and thus may be termed Key Performance Indicators. Metrics can indicate past performance (lagging), or predict future performance (leading).

7.4.4. Process improvement

There tended to be no big picture waiting to be revealed . . . there was only process kaizen . . . focused on isolated individual steps. . . . We coined the term “kamikaze kaizen” . . . to describe the likely result: lots of commotion, many isolated victories . . . [and] loss of the war when no sustainable benefits reached the customer or the bottom line.
— Womack
and Jones

Once processes are measured, the natural desire is to use the measurements to improve them. We discussed Business Process Re-engineering above. In Lean, there are the concepts of kaizen and kaikaku. Kaizen is an incremental process change; kaikaku is a more radical, abrupt shift in the overall process.

There are many ways that process improvement can go wrong.

  • Not basing process improvement in an empirical understanding of the situation

  • Process improvement activities that do not involve those affected

  • Not treating process activities as demand in and of themselves

  • Uncoordinated improvement activities, far from the bottom line

The solutions are to be found largely within Lean theory.

  • Understand the facts of the process; do not pretend to understand based on remote reports. “Go and see,” in other words.

  • Respect people, and understand that best understanding of the situation is held by those closest to it.

  • Make time and resources available for improvement activities. For example, assign them a Problem ticket and ensure there are resources specifically tasked with working it, who are given relief from other duties.

  • Periodically review improvement activities as part of the overall portfolio. You are placing “bets” on them just as with new features. Do they merit your investment?

In the next section, we’ll look at some of the history and theory behind continuous improvement.

7.4.5. The disadvantages of process

Netflix CTO Reed Hastings, in an influential public presentation "Netflix Culture: Freedom and Responsibility," presents a skeptical attitude towards process. In his view, process emerges as a result of an organization’s talent pool becoming diluted with growth, while at the same time its operations become more complex.

Hastings observes that companies that become overly process-focused can reap great rewards as long as their market stays stable. However, when markets change, they also can be fatally slow to adapt.

Netflix’s strategy is to focus on hiring the highest-performance employees and keeping processes minimal. They admit that their industry (minimally-regulated, creative, non-life-critical) is well suited to this approach [120].

The pitfall of process “silos”
One organization enthusiastically embraced process improvement, with good reason: customers, suppliers, and employees found the company’s processes slow, inconsistent, and error prone. Unfortunately, they were so enthusiastic that each team defined the work of their small group or department as a complete process. Of course, each of these was, in fact, the contribution of a specialized functional group to some larger, but unidentified, processes. Each of these “processes” was “improved” independently, and you can guess what happened.

Within the boundaries of each process, improvements were implemented that made work more efficient from the perspective of the performer. However, these mini-processes were efficient largely because they had front-end constraints that made work easier for the performer but imposed a burden on the customer or the preceding process. The attendant delay and effort meant that the true business processes behaved even more poorly than they had before. This is a common outcome when processes are defined too “small.” Moral: Don’t confuse subprocesses or activities with business processes.
— Alex Sharp
Workflow Modeling

The above quote (from [246]) well illustrates the dangers of combining local optimization and process management. Many current authors speak highly of self-organizing teams, but self-organizing teams may seek to optimize locally. Process management was originally intended to overcome this problem, but modeling techniques can be applied at various levels, including within specific departments. This is where enterprise Business Architecture can assist, by identifying these longer, end to end flows of value and highlighting the handoff areas, so that the process benefits the larger objective.

Process proliferation

Another pitfall we cover here is that of process proliferation. Process is a powerful tool. Ultimately it is how value is delivered. However, too many processes can have negative results on an organization. One thing often overlooked in process management and process frameworks is any attention to the resource impacts of the process. This is a primary difference between project and process management; in process management (both theory and frameworks), resource availability is in general assumed.

More advanced forms of process modeling and simulation (see “discrete event simulation”) can provide insight into the resource demands for processes. However, such techniques 1) require specialized tooling and 2) are not part of the typical business process management practitioner’s skillset.

Many enterprise environments have multiple cross-functional processes such as:

  • Service requests

  • Compliance certifications

  • Asset validations

  • Provisioning requests

  • Capacity assessments

  • Change approvals

  • Training obligations

  • Performance assessments

  • Audit responses

  • Expense reporting

  • Travel approvals

These processes sometimes seem to be implemented on the assumption that enterprises can always accommodate another process. The result can be a dramatic overburden for digital staff in complex environments. A frequently-discussed responsibility of Scrum masters and product owners is to “run interference” and keep such enterprise processes from eroding team cohesion and effectivness. It is, therefore, advisable to at least keep an inventory of processes that may impose demand on staff, and understand both the aggregate demand as well as the degree of multi-tasking and context-switching that may result (as discussed in Chapter 5). Thorough automation of all processes to the maximum extent possible can also drive value, to reduce latency, load, and multi-tasking.

Author’s view

Rather than a simplistic view of "process bad" or "process good," it is better to view process as simply a coordination approach. It’s a powerful one with important disadvantages. It should be understood in terms of coordination contexts such as time and space shifting and predictability.

It’s also important to consider lighter weight variations on process, such as case management, checklists, and the submittal schedule.

7.5. Process control and continuous improvement

a process can still be controlled even if it can’t be defined.
— Martin Fowler
preface to Agile Software Development with Scrum
This is more advanced material, but critical to understanding the mathematical basis of Agile methods.

Process management, like project management, is a discipline unto itself and one of the most powerful tools in your toolbox. You start to realize there is a process by which process itself is managed — the process of continuous improvement. You remain concerned that work continues to flow well, that you don’t take on too much work in process, and that people are not overloaded and multi-tasking.

In this chapter section, we take a deeper look at the concept of process and how processes are managed and controlled. In particular, we will explore the concept of continuous (or continual) improvement and its rich history and complex relationship to Agile.

You are now at a stage in your company’s evolution, or your career, where an understanding of continuous improvement is helpful. Without this, you will increasingly find you don’t understand the language and motivations of leaders in your organization, especially those with business degrees or background.

There is a debate over whether to use the term “continuous” or “continual” improvement. We will use “continuous” here as it is the more commonly seen. Advocates of “continual” argue it is the more grammatically correct.

The scope of the word “process” is immense. Examples include:

  • The end to end flow of chemicals through a refinery

  • The set of activities across a manufacturing assembly line, resulting in a product for sale

  • The steps expected of a customer service representative in handling an inquiry

  • The steps followed in troubleshooting a software-based system

  • The general steps followed in creating and executing a project

  • The overall flow of work in software development, from idea to operation

This breadth of usage requires us to be specific in any discussion of the word “process.” In particular, we need to be careful in understanding the concepts of efficiency, variation, and effectiveness. These concepts lie at the heart of understanding process control and improvement and how to correctly apply it in the digital economy.

Companies institute processes because it has been long understood that repetitive activities can be optimized when they are better understood, and if they are optimized, they are more likely to be economical and even profitable. We have emphasized throughout this book that the process by which complex systems are created is not repetitive. Such creation is a process of product development, not production. And yet, the entire digital organization covers a broad spectrum of process possibilities, from the repetitive to the unique. You need to be able to identify what kind of process you are dealing with, and to choose the right techniques to manage it. (For example, the employee provisioning process flow that is shown in the appendix is simple and prescriptive. Measuring its efficiency and variability would be possible, and perhaps useful).

There are many aspects of the movement known as “continuous improvement” that we won’t cover in this brief section. Some of them (systems thinking, culture, and others) are covered elsewhere in this book. This book is based in part on Lean and Agile premises, and continuous improvement is one of the major influences on Lean and Agile, so in some ways, we come full circle. Here, we are focusing on continuous improvement in the context of processes and process improvement. We’ll therefore scope this to a few concerns: efficiency, variation, effectiveness, and process control.

7.5.1. History of continuous improvement

org chart
Figure 139. Gilbreth “scientific management” organization
History is important. You may think your career is far removed from the early days of the Industrial Revolution, but the influence of early management thinkers such as Frederick Taylor defines our world to a degree you probably don’t yet realize. You need to be able to recognize when his ideas are being applied, especially if they are being applied inappropriately (as can easily happen in modern digital organizations).

The history of continuous improvement is intertwined with the history of 20th century business itself. Before the Industrial Revolution, goods and services were produced primarily by local farmers, artisans, and merchants. Techniques were jealously guarded, not shared. A given blacksmith might have two or three workers, who might all forge a pan or a sword in a different way. The term “productivity” itself was unknown.

Then the Industrial Revolution happened.

As steam and electric power increased the productivity of industry, requiring greater sums of capital to fund, a search for improvements began. Blacksmith shops (and other craft producers such as grain millers and weavers) began to consolidate into larger organizations, and technology became more complex and dangerous. It started to become clear that allowing each worker to perform the work as they preferred was not feasible.

Enter the scientific method. Thinkers such as Frederick Taylor and Frank and Lillian Gilbreth (of Cheaper by the Dozen fame,) started applying careful techniques of measurement and comparison, in search of the “one best way” to dig ditches, forge implements, or assemble vehicles. Organizations became much more specialized and hierarchical, as shown in the accompanying early organization chart by Gilbreth (see Gilbreth “scientific management” organization [63]). An entire profession of industrial engineering was established, along with the formal study of business management itself.

7.5.2. Frederick Taylor and efficiency

engineer observing worker
Figure 140. An industrial engineer observing a worker

Frederick Taylor (1856-1915) was a mechanical engineer and one of the first industrial engineers. In 1911, he wrote Principles of Scientific Management. One of Taylor’s primary contributions to management thinking was a systematic approach to efficiency. To understand this, let’s consider some fundamentals.

Human beings engage in repetitive activities. These activities consume inputs and produce outputs. It is often possible to compare the outputs against the inputs, numerically, and understand how “productive” the process is. For example, suppose you have two factories producing identical kitchen utensils (such as pizza cutters). If one factory can produce 50,000 pizza cutters for $2,000, while the other requires $5,000, the first factory is more productive.

Assume for a moment that the workers are all earning the same across each factory, and that both factories get the same prices on raw materials. There is possibly a “process” problem. The first factory is more efficient than the second; it can produce more, given the same set of inputs. Why?

There are many possible reasons. Perhaps the second factory is poorly laid out, and the work in progress must be moved too many times in order for workers to perform their tasks. Perhaps the workers are using tools that require more manual steps. Understanding the differences between the two factories, and recommending the “best way,” is what Taylor pioneered, and what industrial engineers do to this day.

As Peter Drucker, one of the most influential management thinkers, says of Frederick Taylor:

The application of knowledge to work explosively increased productivity. For hundreds of years, there had been no increase in the ability of workers to turn out goods or to move goods. But within a few years after Taylor began to apply knowledge to work, productivity began to rise at a rate of 3.5 to 4 % compound a year—which means doubling every eighteen years or so. Since Taylor began, productivity has increased some fiftyfold in all advanced countries. On this unprecedented expansion rest all the increases in both standard of living and quality of life in the developed countries. [84 pp. 37-38].

The history of industrial engineering is often controversial, however. Hard-won skills were analyzed and stripped from traditional craftspeople by industrial engineers with clipboards (see An industrial engineer observing a worker [64]), who now would determine the “one best way.” Workers were increasingly treated as disposable. Work was reduced to its smallest components of a repeatable movement, to be performed on the assembly line, hour after hour, day after day until the industrial engineers developed a new assembly line. Taylor was known for his contempt for the workers, and his methods were used to increase work burdens sometimes to inhuman levels. Finally, some kinds of work simply can’t be broken into constituent tasks. High performing, collaborative, problem-solving teams do not use Taylorist principles, in general. Eventually, the term "Taylorism” was coined, and today is often used more as a criticism than a compliment.

7.5.3. W.E. Deming and variation

The quest for effiency leads to the long-standing management interest in variability and variation. What do we mean by this?

If you expect a process to take 5 days, what do you make of occurrences when it takes 7 days? 4 days? If you expect a manufacturing process to yield 98% usable product, what do you do when it falls to 97%? 92%? In highly repeatable manufacturing processes, statistical techniques can be applied. Analyzing such “variation” has been a part of management for decades, and is an important part of disciplines such as Six Sigma. This is why Six Sigma is of such interest to manufacturing firms.

W. Edwards Deming (1900-1993) is noted for (among many other things) his understanding of variation and organizational responses to it. Understanding variation is one of the major parts of his “System of Profound Knowledge.” He emphasizes the need to distinguish special causes from common causes of variation; special causes are those requiring management attention.

Deming, in particular, was an advocate of the control chart, a technique developed by Walter Shewhart, to understand whether a process was within statistical control (see Process control chart [65]).

control chart
Figure 141. Process control chart

However, using techniques of this nature makes certain critical assumptions about the nature of the process. Understanding variation and when to manage it requires care. These techniques were defined to understand physical processes that in general follow normal distributions.

For example, let’s say you are working at a large manufacturer, in their IT organization, and you see the metric of "variance from project plan.” The idea is that your actual project time, scope and resources should be the same, or close to, what you planned. In practice, this tends to become a discussion about time, as resources and scope are often fixed.

The assumption is that, for your project tasks, you should be able to estimate to a meaningful degree of accuracy. Your estimates are equally likely to be too low, or too high. Furthermore, it should be somehow possible to improve the accuracy of your estimates. Your annual review depends on this, in fact.

The problem is that neither of these is true. Despite heroic efforts, you cannot improve your estimation. In process control jargon, there are too many causes of variation for “best practices” to emerge. Project tasks remain unpredictable, and the variability does not follow a normal distribution. Very few tasks get finished earlier than you estimated, and there is a long tail to the right, of tasks that take 2x, 3x or 10x longer than estimated.

Learning some statistics is essential if you want to progress in your career. This section assumes you are comfortable with the concept of a “distribution” and in particular what the “normal distribution” is.

In general, applying statistical process control to variable, creative product development processes is inappropriate. For software development, Steven Kan states: “Many assumptions that underlie control charts are not being met in software data. Perhaps the most critical one is that data variation is from homogeneous sources of variation.” That is, the causes of variation are knowable and can be addressed. This is in general not true of development work. [146]

Deming (along with Juran) is also known for “continuous improvement” as a cycle, e.g., "Plan/Do/Check/Act” or "Define/Measure/Analyze/Implement/Control.” Such cycles are akin to the scientific method, as they essentially engage in the ongoing development and testing of hypotheses, and the implementation of validated learning. We touch on similar cycles in our discussions of Lean Startup, OODA, and Toyota Kata.

7.5.4. Lean product development and cost of delay

the purpose of controlling the process must be to influence economic outcomes. There is no other reason to be interested in process control.
— Don Reinertsen
Managing the Design Factory

Discussions of efficiency usually focus on productivity that is predicated on a certain set of inputs. Time can be one of those inputs. Everything else being equal, a company that can produce the pizza cutters more quickly is also viewed as more efficient. Customers may pay a premium for early delivery, and may penalize late delivery; such charges typically would be some percentage (say plus or minus 20%) of the final price of the finished goods.

However, the question of time becomes a game-changer in the “process” of new product development. As we have discussed previously, starting with a series of influential articles in the early 1980s, Don Reinertsen developed the idea of cost of delay for product development [220].

Where the cost of a delayed product shipment might be some percentage, the cost of delay for a delayed product could be much more substantial. For example, if a new product launch misses a key trade show where competitors will be presenting similar innovations, the cost to the company might be millions of dollars of lost revenue or more — many times the product development investment.

This is not a question of “efficiency;” of comparing inputs to outputs and looking for a few percentage points improvement. It is more a matter of effectiveness; of the company’s ability to execute on complex knowledge work.

7.5.5. Scrum and empirical process control

process theory experts . . . were amazed and appalled that my industry, systems development, was trying to do its work using a completely inappropriate process control model.
— Ken Schwaber
Agile Software Development with Scrum

Ken Schwaber, inventor of the Scrum methodology (along with Jeff Sutherland), like many other software engineers in the 1990s, experienced discomfort with the Deming-inspired process control approach promoted by major software contractors at the time. Mainstream software development processes sought to make software development predictable and repeatable in the sense of a defined process.

As Schwaber discusses [241 pp. 24-25] defined processes are completely understood, which is not the case with creative processes. Highly automated industrial processes run predictably, with consistent results. By contrast, complex processes that are not understood require an empirical model.

Empirical process control, in the Scrum sense, relies on frequent inspection and adaptation. After exposure to Dupont process theory experts who clarified the difference between defined and empirical process control, Schwaber went on to develop the influential Scrum methodology. As he notes:

During my visit to DuPont . . . I realized why [software development] was in such trouble and had such a poor reputation. We were wasting our time trying to control our work by thinking we had an assembly line when the only proper control was frequent and first-hand inspection, followed by immediate adjustments. [241 p. 25].

There’s little question that the idea of statistical process control for digital product development is thoroughly discredited. However, this is not only a textbook on digital product development (as a form of R&D). It covers all of traditional IT management, in its new guise of the digitally transformed organization. Development is only part of digital management.

7.6. Conclusion

In conclusion, we have considered:

  • Defining dependencies and coordination

  • Principles, and techniques of coordination

  • Coordination and delivery models

  • Process management

  • Background and history of continuous improvement

Coordination is a hard problem, and will only get more difficult as you scale up. However, it’s not magic and as a problem can be defined and analyzed. Process management is one response to the coordination problem (although its value extends beyond). Like any powerful tool, it has its dangers if misused. Be wary of claims of statistical process control in creative activities, and avoid burdensome process tracking and compliance approaches.

7.6.1. Discussion questions

  • Review The Secret to Amazons Success Internal APIs. Discuss in terms of solving coordination problems. Where does an API-based approach reach its limits as a coordination strategy?

  • Have you experienced a problem where an improved process would have helped? What about a problem where process seemed to be the cause of it?

  • What processes do you experience daily? Weekly?

  • Is some measurement of the process part of your experience? Think broadly.

  • Sometimes, organizations try to treat a complex process as defined, instead of managing it empirically. Sometimes, people react by calling this “Taylorism.” Why? Google and discuss.

7.6.2. Research & practice

  • Review the Strode dependency taxonomy and document examples. Present to your class for discussion.

  • Research BPMN notation and use it to document a process familiar to you.

  • Compare and contrast business process modeling with Lean value stream mapping.

  • Metrics are associated with most processes. Research well known industry processes that have standard metrics for comparison. For example, consider air travel and how airlines are compared regarding their flight operations processes.

  • Research process simulation and prepare a report. Optionally, compare process simulation with systems dynamics modeling.

  • We know that completely defined processes can be placed under statistical control, and that creative processes cannot be. What about processes falling between these two extremes, such as help desk call management? What does the research say about statistical control of such processes?

7.6.3. Further reading

Coordination is a general and abstract topic, and the research from [259, 155] is a good place to start. Practical discussions of coordination are found in [221] (especially Chapter 7-8) and also [220, 255]. From a software practitioner’s perspective see [67]. The history of Scrum is relevant, in particular, [241]. Critiques of statistical process control for software development are found in [30, 218].

The definitive BPM reference is [234]. Good practical applications are found in [44, 116, 246]. Lean theory is also essential, e.g., [286, 228].







8. Investment and Planning

8.1. Introduction

dollar in pieces
Figure 142. How to invest in your organization?

Your decision to break your organization into multiple teams is an investment decision. You are going to devote some of your resources to one team, and some to another team. Furthermore, there will be additional spending still managed at a corporate level. If the results meet your expectations, you will then likely proceed with further investments managed by the same or similar organization structure. How do you decide on and manage these separate streams of investment? What is your approach for structuring them? How do you ensure that they are returning your desired results?

People are competitive. Your multiple teams will start to contend for investment. This is unavoidable. They will want their activities adequately supported, and their understanding of “adequate” may be different from yours—​and will also vary between each other. They will be watching that the other teams don’t get “more than their share” and are using their share effectively. You start to see that the teams need to be constantly reminded of the big picture, in order to keep their discussions and occasional disagreements constructive.

You now have a dedicated, full-time CFO and you are increasingly subject to standard accounting and budgeting rules. But annual budgeting seems to be opposed to how you have chosen to run your company to date. What alternatives are there? You begin to see that your approach to financial management affects every aspect of your company, including your product team effectiveness.

You also begin to see your vendor relationships (e.g.,your cloud providers) as a form of investment. As your use of their products deepens, it becomes more difficult to switch from them, and so you spend more time evaluating before committing. You establish a more formalized approach. Open source changes the software vendor relationship to some degree, but it’s still a portfolio of commitments and relationships requiring management.

Project management is often seen as necessary for financial planning, especially regarding the efforts most critical to the business. The reason it is seen as essential is because of the desire to coordinate, manage, and plan. Having a vision isn’t worth much without some ability to forecast how long it will take and what it will cost, and to monitor progress against the forecast in an ongoing way. Project management is often defined as the execution of a given scope of work within constraints of time and budget. But questions arise. We are already executing work, without this concept of “project.” We have discussed Scrum in Chapter 4, Kanban in Chapter 5, and various organizational levels and delivery models in the introduction to Part III. Here we will examine this idea of “scope” in more detail. How would we know it in advance, so that the “constraints of time and budget” are reasonable?

As we have seen in our discussions of product management, in implementing truly new products, (including digital products) estimating time and budget are challenging because the necessary information is not available. In fact, creating information -— which per Lean Product Development) requires tight feedback loops -— is the actual work of the “project.” Therefore, in the new Agile world, there is some uncertainty as to the role of and even need for traditional project management. This chapter will examine some of the reasons for project management’s persistence and how it is adapting to the new product-centric reality.

In the project management literature and tradition, much focus is given to the execution aspect of project management -— its ability to manage complex, interdependent work across resource limitations. We discussed project management and execution in Chapter 7. In Chapter 8, we are interested in the structural role of project management as a way of managing investments. Project management may be institutionalized through establishing an organizational function known as the Project Management Office (PMO), which may use a concept of project portfolio as a means of constraining and managing the organization’s multiple priorities. What is the relationship of the traditional PMO to the new, product-centric, digital world? [66]

8.1.1. Chapter 8 outline

In this chapter, we will cover:

  • IT financial management

    • Historical IT financial practices

    • Next generation IT finance

  • IT sourcing and contract management

    • Basic concerns

    • Outsourcing and cloudsourcing

    • Software licensing

    • The role of industry analysts

    • Software development and contracts

  • Structuring the investment

    • Features versus components

    • Epics and new products

  • Larger-scale planning and estimating

    • Why plan?

    • Planning larger efforts

  • Why project management?

    • A traditional IT project

    • How is a project different from simple “work management"?

    • The “iron triangle”

    • Project practices

    • The future of project management

  • Topics

    • Critical chain

    • The Agile project frameworks

  • Conclusion

8.1.2. Chapter 8 learning objectives

  • Describe traditional and next generation IT finance concerns

  • Describe various topics and issues in digital and IT sourcing

  • Identify and describe techniques for structuring digital investment portfolios

  • Describe basic project management practices

  • Critically evaluate the role and limitations of project management as a delivery vehicle

8.2. IT financial management

Computers initially were used to automate manual operations, and the benefits were relatively easy to forecast. As organizations use computers more for strategic purposes and management information enhancements, the benefits become harder to forecast. In a growing number of companies, the individuals most qualified to forecast the most significant costs and benefits are product managers, financial specialists, marketing specialists, and not the I/S technical specialists . . .
— Terence Quinlan
IT Financial Management Association

Financial health is an essential dimension of business health. And digital technology has been one of the fastest-growing budget items in modern enterprises. Its relationship to enterprise financial management has been a concern since the first computers were acquired for business purposes.

Financial management is a broad, complex, and evolving topic and its relationship to IT and digital management is even more so. This brief section can only cover a few basics. However, it is important for you to have an understanding of the intersection of Agile and Lean IT with finance, as your organization’s financial management approach can determine the effectiveness of your digital strategy. See the cited sources and Further Reading at the end of this chapter.

The objectives of IT finance include:

  • Providing a framework for tracking and accounting for digital income and expenses

  • Supporting financial analysis of digital strategies (business models and operating models, including sourcing)

  • Support the digital and IT-related aspects of the corporate budgetary and reporting processes, including internal and external policy compliance

  • Supporting (where appropriate) internal cost recovery from business units consuming digital services

  • Support accurate and transparent comparison of IT financial performance to peers and market offerings (benchmarking)

A company scaling up today would often make different decisions from a company that scaled up 40 years ago. This is especially apparent in the matter of how to finance digital/IT systems development and operations. The intent of this section is to explore both the traditional approaches to IT financial management and the emerging Agile/Lean responses.

This section has the following outline:

  • Historical IT financial practices

    • Annual budgeting and project funding

    • Cost accounting and chargeback

  • Next generation IT finance

    • Lean Accounting & Beyond Budgeting

    • Lean Product Development

    • Internal “venture” funding

    • Value stream orientation

    • Internal market economics

    • Service brokerage

8.2.1. Historic IT financial practices

Historically, IT financial management has been defined by two major practices:

  • An annual budgeting cycle, in which project funding is decided

  • Cost accounting, sometimes with associated internal transfers (chargebacks) as a primary tool for understanding and controlling IT expenses

Both of these practices are being challenged by Agile and Lean IT thinking.

Annual budgeting and project funding
IT organizations typically adhere to annual budgeting and planning cycles—which can involve painful rebalancing exercises across an entire portfolio of technology initiatives, as well as a sizable amount of rework and waste. This approach is anathema to companies that are seeking to deploy agile at scale. Some businesses in our research base are taking a different approach. Overall budgeting is still done yearly, but road maps and plans are revisited quarterly or monthly, and projects are reprioritized continually. [68]
— Comella-Dorda et al
An Operating Model for Company-Wide Agile Development

In the common practice of the annual budget cycle, companies once a year embark on a detailed planning effort to forecast revenues and how they will be spent. Much emphasis is placed on the accuracy of such forecasts, despite its near-impossibility. (If estimating one large software project is challenging, how much more challenging to estimate an entire enterprise’s income and expenditures?)

The annual budget has two primary components: capital expenditures and operating expenditures, commonly called CAPEX and OPEX. The rules about what must go where are fairly rigid and determined by accounting standards with some leeway for the organization’s preferences.

Software development can be capitalized, as it is an investment from which the company hopes to benefit from in the future. Operations is typically expensed. Capitalized activities may be accounted for over multiple years (therefore becoming a reasonable candidate for financing and multi-year amortization). Expensed activities must be accounted for in-year.

One can only “go to the well” once a year. As such, extensive planning and negotiation traditionally take place around the IT project portfolio. Business units and their IT partners debate the priorities for the capital budget, assess resources, and finalize a list of investments. Project managers are identified and tasked with marshaling the needed resources for execution.

This annual cycle receives much criticism in the Agile and Lean communities. From a Lean perspective, projects can be equated to large “batches” of work. Using annual projects as a basis for investment can result in misguided attempts to plan large batches of complex work in great detail so that resource requirements can be known well in advance. The history of the Agile movement is in many ways a challenge and correction of this thinking, as we have discussed throughout this book.

The execution model for digital/IT acquisition adds further complexity. Traditionally, project management has been the major funding vehicle for capital investments, distinct from the operational expense. But with the rise of cloud computing and product-centric management, companies are moving away from traditional capital projects. New products are created with greater agility, in response to market need, and without the large capital investments of the past in physical hardware.

This does not mean that traditional accounting distinctions between CAPEX and OPEX go away. Even with expensed cloud infrastructure services, software development may still be capitalized, as may software licenses.

Cost accounting and chargeback

The term “cost accounting” is not the same as just “accounting for costs,” which is always a concern for any organization. Cost accounting is defined as “the techniques for determining the costs of products, processes, projects, etc. in order to report the correct amounts on the financial statements, and assisting management in making decisions and in the planning and control of an organization . . . For example, cost accounting is used to compute the unit cost of a manufacturer’s products in order to report the cost of inventory on its balance sheet and the cost of goods sold on its income statement. This is achieved with techniques such as the allocation of manufacturing overhead costs and through the use of process costing, operations costing, and job-order costing systems." [4]

Information technology is often consumed as a "shared service” which requires internal financial transfers. What does this mean?

Here is a traditional example. An IT group purchases a mainframe computer for $1,000,000. This mainframe is made available to various departments who are charged for it. Because the mainframe is a shared resource, it can run multiple workloads simultaneously. For the year, we see the following usage:

  • 30% Accounting

  • 40% Sales operations

  • 30% Supply chain

In the simplest direct allocation model, the departments would pay for the portion of the mainframe that they used. But things always are more complex than that.

  • What if the mainframe has excess capacity? Who pays for that?

  • What if Sales Operations stops using the mainframe? Do Accounting and Supply Chain have to make up the loss? What if Accounting decides to stop using it because of the price increase? In public utilities, this is known as a death spiral and the problem was noted as early as 1974 by Richard Nolan [198 p. 179].

  • The mainframe requires power, floor space, and cooling. How are these incorporated into the departmental charges?

  • Ultimately, the Accounting organization (and perhaps Supply Chain) are back-office cost centers as well. Does it make sense for them to be allocated revenues from company income, only to have those revenues then re-directed to the IT department?

Historically, cost accounting has been the basis for much IT financial management (see e.g., ITIL Service Strategy, [265], p.202; [217]). Such approaches traditionally seek full absorption of unit costs; that is, each “unit” of inventory ideally represents the total cost of all its inputs: materials, labor, and overhead such as machinery and buildings.

In IT/digital service organizations, there are three basic sources of cost: “cells, atoms, and bits.” That is:

  • People (i.e. their time)

  • Hardware

  • Software

However, these are “direct” costs — costs that for example a product or project manager can see in their full amount.

Another class of cost is “indirect.” The IT service might be charged $300 per square foot for data center space. This provides access to rack space, power, cooling, security, and so forth. This charge represents the bills the Facilities organization receives from power companies, mechanicals vendors, security services, and so forth — not to mention the mortgage!

Finally, the service may depend on other services. Perhaps instead of a dedicated database server, the service subscribes to a database service that gives them a high-performance relational database, but where they do not pay directly for the hardware, software, and support services the database is based on. Just to make things more complicated, the services might be external (public cloud) or internal (private cloud or other offerings).

Those are the major classes of cost. But how do we understand the “unit of inventory” in an IT services context? A variety of concepts can be used, depending on the service in question:

  • Transactions

  • Users

  • Network ports

  • Storage (e.g.,gigabytes of disk)

In internal IT organizations (see "Defining consumer, customer, and sponsor") this cost accounting is then used to transfer funds from the budgets of consuming organizations to the IT budget. Sometimes this is done via simple allocations (Marketing pays 30%, Sales pays 25%, etc.). and sometimes this is done through more sophisticated approaches, such as defining unit costs for services.

For example, the fully absorbed unit cost for a customer account transaction might be determined to be $0.25; this ideally represents the total cost of the hardware, software, and staff divided by the expected transactions. Establishing the models for such costs, and then tracking them, can be a complex undertaking, requiring correspondingly complex systems.

IT managers have known for years that overly detailed cost accounting approaches can result in consuming large fractions of IT resources. As AT&T financial manager John McAdam noted:

“Utilizing an excessive amount of resources to capture data takes away resources that could be used more productively to meet other customer needs. Internal processing for IT is typically 30-40% of the workload. Excessive data capturing only adds to this overhead cost.” [182]

There is also the problem that unit costing of this nature creates false expectations. Establishing an internal service pricing scheme implies that if the utilization of the service declines, so should the related billings. But if

  1. the hardware, software, and staff costs are already sunk, or relatively inflexible and

  2. the IT organization is seeking to recover costs fully

the per-transaction cost will simply have to increase if the number of transactions goes down. James R. Huntzinger discusses the problem of excess capacity distorting unit costs, and states “it is an absolutely necessary element of accurate representation of the facts of production that some provisions be made for keeping the cost of wasted time and resources separate from normal costs” [131]. Approaches for doing this will be discussed below.

8.2.2. Next generation IT finance

What accounting should do is produce an unadulterated mirror of the business — an uncompromisable truth on which everyone can rely. …​ Only an informed team, after all, is truly capable of making intelligent decisions.
— Orest Fiume and Jean Cunningham
as quoted by James Huntzinger

Criticisms of traditional approaches to IT finance have increased as digital transformation accelerates and companies turn to Agile operating models, Rami Sirkia and Maarit Laanti (in a paper used as the basis for the Scaled Agile Framework's financial module) describe the following shortcomings:

  • Long planning horizons, detailed cost estimates that must frequently be updated

  • Emphasis on planning accuracy and variance analysis

  • Context-free concern over budget overruns (even if a product is succeeding in the market, variances are viewed unfavorably)

  • Bureaucratic re-approval processes for project changes

  • Inflexible and slow resource re-allocation [253]

What do critics of cost accounting, allocated chargebacks, and large batch project funding suggest as alternatives to the historical approaches? There are some limitations evident in many discussions of Lean Accounting, notably an emphasis on manufactured goods. However, a variety of themes and approaches have emerged relevant to IT services, that we will discuss below:

  • Beyond Budgeting

  • Internal “venture” funding

  • Value stream orientation

  • Lean Accounting

  • Lean Product Development

  • Internal market economics

  • Service brokerage

Beyond budgeting
Setting a numerical target and controlling performance against it is the foundation stone of the budget contract in today’s organization. But, as a concept, it is fundamentally flawed. It is like telling a motor racing driver to achieve an exact time for each lap . . . it cannot predict and control extraneous factors, any one of which could render the target totally meaningless. Nor does it help to build the capability to respond quickly to new situations. But, above all, it doesn’t teach people how to win.
— Jeremy Hope and Robin Fraser
Beyond Budgeting Questions and Answers

Beyond Budgeting is the name of a 2003 book by Jeremy Hope and Robin Fraser. It is written in part as an outcome of meetings and discussions between a number of large, mostly European firms dissatisfied with traditional budgeting approaches. Beyond Budgeting’s goals are described as:

releasing capable people from the chains of the top-down performance contract and enabling them to use the knowledge resources of the organization to satisfy customers profitably and consistently beat the competition

In particular, Beyond Budgeting critiques the concept of the “budget contract.” A simple “budget” is merely a “financial view of the future . . . [a] 'most likely outcome' given known information at the time . . . " A “budget contract” by comparison is intended to “delegate the accountability for achieving agreed outcomes to divisional, functional, and departmental managers.” It includes concerns and mechanisms such as

  • Targets

  • Rewards

  • Plans

  • Resources

  • Coordination

  • Reporting

and is intended to "commit a subordinate or team to achieving an agreed outcome.”

Beyond Budgeting identifies various fallacies with this approach, including:

  • The idea that fixed financial targets maximize profit potential

  • Financial incentives build motivation and commitment (see discussion on Motivation)

  • Annual plans direct actions that maximize market opportunities

  • Central resource allocation optimizes efficiency

  • Central coordination brings coherence

  • Financial reports provide relevant information for strategic decision- making

Beyond the poor outcomes that these assumptions generate, up to 20% to 30% of senior executives' time is spent on the annual budget contract. Overall, the Beyond Budgeting view is that the budget contract is

a relic from an earlier age. It is expensive, absorbs far too much time, adds little value, and should be replaced by a more appropriate performance management model. [124 p. 4], emphasis added.

Readers of this textbook should at this point notice that many of the Beyond Budgeting concerns reflect an Agile/Lean perspective. The fallacies of efficiency and central coordination have been discussed throughout this book. However, if an organization’s financial authorities remain committed to these as operating principles, the digital transformation will be difficult at best.

Beyond Budgeting proposes a number of principles for understanding and enabling organizational financial performance. These include:

  • Event-driven over annual planning

  • On-demand resources over centrally allocated resources

  • Relative targets (“beating the competition”) over fixed annual targets

  • Team-based rewards over individual rewards

  • Multi-faceted, multi-level, forward-looking analysis over analyzing historical variances

Internal “venture” funding
A handful of companies are even exploring a venture-capital-style budgeting model. Initial funding is provided for minimally viable products (MVPs), which can be released quickly, refined according to customer feedback, and relaunched in the marketplace . . . And subsequent funding is based on how those MVPs perform in the market. Under this model, companies can reduce the risk that a project will fail, since MVPs are continually monitored and development tasks reprioritized . . . [68]
— Comella-Dorda et al
An Operating Model for Company-Wide Agile Development

As we have discussed previously, product and project management are distinct. Product management, in particular, has more concern for overall business outcomes. If we take this to a logical conclusion, the product portfolio becomes a form of the investment portfolio, managed not in terms of schedule and resources, but rather in terms of business results.

This implies the need for some form of internal venture funding model, to cover the initial investment in a minimum viable product. If and when this internal investment bears fruit, it may become the basis for a value stream organization, which can then serve as a vehicle for direct costs and an internal services market (see below). McKinsey reports the following case:

A large European insurance provider restructured its budgeting processes so that each product domain is assigned a share of the annual budget, to be utilized by chief product owners. (Part of the budget is also reserved for requisite maintenance costs). Budget responsibilities have been divided into three categories: a development council consisting of business and IT managers meets monthly to make go/no-go decisions on initiatives. Chief product owners are charged with the tactical allocation of funds—making quick decisions in the case of a new business opportunity, for instance—and they meet continually to rebalance allocations.

Meanwhile, product owners are responsible for ensuring execution of software development tasks within 40-hour work windows and for managing maintenance tasks and backlogs; these, too, are reviewed on a rolling basis. As a result of this shift in approach, the company has increased its budgeting flexibility and significantly improved market response times. [68]

With a rolling backlog and stable funding that decouples annual allocation from ongoing execution, the venture-funded product paradigm is likely to continue growing. A product management mindset activates a variety of useful practices, as we will discuss in the next section.

Options as a portfolio strategy

In governing for effectiveness and innovation, one technique is that of options. Related to the idea of options is parallel development. In investing terms, purchasing an option gives one the right, but not the obligation, to purchase a stock (or other value) for a given price at a given time. Options are an important tool for investors to develop strategies to compensate for market uncertainty.

What does this have to do with developing digital products?

Product development is so uncertain that sometimes it makes sense to try several approaches at once. This, in fact, was how the program to develop the first atomic bomb was successfully managed. Parallel development is analogous to an options strategy. Small, sustained investments in different development “options” can optimize development payoff in certain situations (see [221], Chapter 4). When taken to a logical conclusion, such an options strategy starts to resemble the portfolio management approaches of venture capitalists. Venture capitalists invest in a broad variety of opportunities, knowing that most, in fact, will not pan out. See discussion of internal venture funding as a business model.

It is arguable that the venture-funded model has created different attitudes and expectations towards governance in West Coast “unicorn” culture. However, it would be dangerous to assume that this model is universally applicable. A firm is more than a collection of independent sub-organizations; this is an important line of thought in management theory, starting with Coase’s “The Nature of the Firm” [63].

The idea that “Real Options” were a basis for Agile practices was proposed by Chris Matts [181]. Investment banker turned Agile coach Franz Ivancsich strongly cautions against taking options theory too far, noting that to price them correctly you have to determine the complete range of potential values for the underlying asset [145].

Lean product development
Because we never show it on our balance sheet, we do not think of [Design in Process] as an asset to be managed, and we do not manage it.
— Don Reinertsen
Managing the Design Factory

The Lean Product Development thought of Don Reinertsen was discussed extensively in Chapter 5. His emphasis on employing an economic framework for decisionmaking is relevant to this discussion as well. In particular, his concept of cost of delay is poorly addressed in much IT financial planning, with its concerns for full utilization, efficiency, and variance analysis. Other Lean accounting thinkers share this concern, e.g.:

the cost-management system in a lean environment must be more reflective of the physical operation. It must not be confined to monetary measures but must also include nonfinancial measures, such as quality and throughput times.[131]

Another useful Reinertsen concept is that of design in process or DIP. This is an explicit analog to the well-known Lean concept of work in process (WIP). Reinertsen makes the following points [220 p. 13]:

  • DIP is larger and more expensive to hold than WIP

  • It has much lower turn rates

  • It has much higher holding costs (e.g.,due to obsolescence)

  • DIP’s liabilities are ignored due to weaknesses in accounting standards

These concerns give powerful economic incentives for managing throughput and flow, continuously re-prioritizing for the maximum economic benefit and driving towards the Lean ideal of single-piece flow.

Lean accounting
It was not enough to chase out the cost accountants from the plants; the problem was to chase cost accounting from my people’s minds.
— Taiichi Ohno

There are several often-cited motivations for cost accounting [131] p 13:

  • Inventory valuation (not applicable for intangible services)

  • Informing pricing strategy

  • Management of production costs

IT service costing has long presented additional challenges to traditional cost accounting. As IT Financial Management Association president Terry Quinlan notes, “Many factors have contributed to the difficulty of planning EDP expenditures at both application and overall levels. A major factor is the difficulty of measuring fixed and variable cost.” [217], p. 6.

This begs the broader question: should traditional cost accounting be the primary accounting technique used at all? Cost accounting in Lean and Agile organizations is often controversial. Lean pioneer Taiichi Ohno of Toyota thought it was a flawed approach. Huntzinger [131] identifies a variety of issues:

  • Complexity

  • Un-maintainability

  • Supplies information “too late” (i.e., does not support fast feedback)

  • “Overhead” allocations result in distortions

Shingo Prize winner Steve Bell observes:

It is usually impossible to tie . . . abstract cost allocations and the resulting variances back to the originating activities and the value they may or may not produce; thus, they can’t help you improve. But they can waste your time and distract you from the activities that produce the desired outcomes . . . [20 p. 110].

The trend in Lean accounting has been to simplify. A guiding ideal is seen in the Wikipedia article on Lean Accounting:

The “ideal” for a manufacturing company is to have only two types of transactions within the production processes; the receipt of raw materials and the shipment of finished product.

Concepts such as value stream orientation, internal market economics, and service brokering all can contribute towards achieving this ideal.

Value stream orientation
Collecting costs into traditional financial accounting categories, like labor, material, overhead, selling, distribution, and administrative, will conceal the underlying cost structure of products . . . The alternative to traditional methods . . . is the creation of an environment that moves indirect costs and allocation into direct costs.[131]
— James R. Huntzinger
Lean Cost Management

As discussed above, Lean thinking discourages the use of any concept of overhead, sometimes disparaged as “peanut butter” costing. Rather, direct assignment of all costs to the maximum degree is encouraged, in the interest of accounting simplicity.

We discussed a venture-funded product model above, as an alternative to project-centric approaches. Once a product has proven its viability and becomes an operational concern, it becomes the primary vehicle for those concerns previously handled through cost accounting and chargeback. The term for a product that has reached this stage is “value stream.” As Huntzinger notes, “Lean principles demand that companies align their operations around products and not processes.” [131 p. 19].

By combinining a value stream orientation in conjunction with organizational principles such as frugality, internal market economics, and decentralized decision-making (see e.g., [124 p. 12]), both Lean and Beyond Budgeting argue that more customer-satisfying and profitable results will ensue. The fact that the product, in this case, is digital (not manufactured), and the value stream centers around product development (not production) does not change the argument.

Internal market economics
value stream and product line managers, like so much in the lean world, are “fractal.”
— Womack and Jones
Lean Thinking
Coordinate cross-company interactions through “market-like” forces.
— Jeremy Hope and Robin Fraser
Beyond Budgeting Questions and Answers

IT has long been viewed as a “business within a business.” In the internal market model, services consume other services ad infinitum [185]. Sometimes the relationship is hierarchical (an application team consuming infrastructure services) and sometimes it is peer to peer (an application team consuming another’s services, or a network support team consuming email services, which in turn require network services).

The increasing sourcing options including various cloud options make it more and important that internal digital services be comparable to external markets. This, in turn, puts constraints on traditional IT cost recovery approaches, which often result in charges with no seeming relationship to reasonable market prices.

There are several reasons for this. One commonly cited reason is that internal IT costs include support services, and therefore cannot fairly be compared to simple retail prices (e.g.,for a computer as a good).

Another, more insidious reason is the rolling in of unrelated IT overhead to product prices. We have quoted James Huntzinger’s work above in various places on this topic. Dean Meyer has elaborated this topic in greater depth from an IT financial management perspective, calling for some organzational “goods” to be funded as either Ventures (similar to above discussion) or “subsidies” (for enterprise-wide benefits such as technical standardization) [185 p. 92].

As discussed above, a particularly challenging form of IT overhead is excess capacity. The saying “the first person on the bus has to buy the bus” is often used in IT shared services, but is problematic. A new, venture-funded startup cannot operate this way — expecting the first few customers to fund the investment fully! Nor can this work in an internal market, unless heavy handed political pressure is brought to bear. This is where internal venture funding is required.

Meyer presents a sophisticated framework for understanding and managing an internal market of digital services. This is not a simple undertaking; for example, correctly setting service prices can be surprisingly complex.

Service brokerage

Finally, there is the idea that digital or IT services should be aggregated and “brokered” by some party (perhaps the descendant of a traditional IT organization). In partcular, this helps with capacity management, which can be a troublesome source of internal pricing distortions. This has been seen not only in IT services, but in Lean attention to manufacturing; when unused capacity is figured into product cost accounting, distortions occur [131], Chapter 7, “Church and Excess Capacity.”

Applying Meyer’s principles, excess capacity would be identified as a Subsidy or a Venture as a distinct line item.

But cloud services can assist even further. Excess capacity often results from the available quantities in the market — e.g., one purchases hardware in large-grained capital units. But more flexibly priced, expensed compute on-demand services are available, it is feasible to allocate and de-allocate capacity on-demand, eliminating the need for accounting for excess capacity.

8.3. IT sourcing and contract management

IT sourcing is the set of concerns related to identifying suppliers (sources) of the necessary inputs to deliver digital value. Contract management is a critical, subsidiary concern, where digital value intersects with law.

The basic classes of inputs include:

  • People (with certain skills and knowledge)

  • Hardware

  • Software

  • Services (which themselves are composed of people, hardware, and/or software)

Practically speaking, these inputs are handled by two different functions:

  • People (in particular, full time employees) are handled by a Human Resources function.

  • Hardware, software, and services are handled by a Procurement function. Other terms associated with this are Vendor Management, Contract Management, and Supplier Management. We will not attempt to clarify or rationalize these areas in this section.

We discussed hiring and managing digital specialists in the previous chapter.

8.3.1. Basic concerns

A small company may establish binding agreements with vendors relatively casually. For example, when the founder first chose a cloud platform on which to build their product, they clicked on a button that said “I accept,” at the bottom of a lengthy legal document they didn’t necessarily read closely. This “clickwrap” agreement (see Clickwrap example) is a legally binding contract, which means that the terms and conditions it specifies are enforceable by the courts.

Clickwrap license
Figure 143. Clickwrap example

A startup may be inattentive to the full implications of its contracts for various reasons:

  • The founder does not understand the importance and consequences of legally binding agreements

  • The founder understands but feels they have little to lose (for example, they have incorporated as a limited liability company, meaning the founder’s personal assets are not at risk)

  • The service is perceived to be so broadly used that an agreement with it must be safe (if 50 other startups are using a well known cloud provider and prospering, why would a startup founder spend precious time and money on overly detailed scrutiny of its agreements?)

All of these assumptions bear some risk -— and many startups have been burned on such matters -— but there are many other, perhaps more pressing risks for the founder and startup.

However, by the time the company has scaled to the team of teams level, contract management is almost certainly a concern of the Chief Financial Officer. The company has more resources (“deeper pockets”), and so lawsuits start to become a concern. The complexity of the company’s services may require more specialized terms and conditions. Standard “boilerplate” contracts thus are replaced by individualized agreements. Concerns over intellectual property, the ability to terminate agreements, liability for damages and other topics require increased negotiation and counterproposing contractual language. See the case study on the 9 figure true-up for a grim scenario.

At this point, the company may have hired its own legal professional; certainly, legal fees are going up, whether as services from external law firms or internal staff.

Contract and vendor management is more than just establishing the initial agreement. The ongoing relationship must be monitored for its consistency with the contract. If the vendor agrees that its service will be available 99.999% of the time, the availability of that service should be measured, and if it falls below that threshhold, contractual penalties may need to be applied.

In larger organizations, where a vendor might hold multiple contracts, a concept of "vendor management” emerges. Vendors may be provided “scorecards” that aggregate metrics which describe their performance and the overall impression of the customer. Perhaps key stakeholders are internally surveyed as to their impression of the vendor’s performance and whether they would be likely to recommend them again. Low scores may penalize a vendor’s chances in future RFI/RFP processes; high scores may enhance them. Finally, advising on sourcing is one of the main services an Enterprise Architecture group may provide.

Case study: Choosing a telecommunications provider

When Company X was a startup, its telecommunications needs were limited, as were its options. The founder had one choice for Internet access, the local cable company. Even when the company moved to a larger space, as a single team startup, its options were limited.

However, it is now a company of 50, and moving yet again to a new headquarters where there are a variety of options for network carriers. The company is known to be growing, and three telecommunications companies (“carriers”) have been sending sales representatives periodically to inquire if their services might be needed.

With the move to a new facility, some systematic effort must be undertaken to choose an appropriate provider. This becomes a sub-project in its own right and is part of the larger program required to complete the move effectively.

As part of this project, a formal “Request for Information” (RFI) is sent to all the potential carriers. Part of this RFI consists of a lengthy series of questions, such as:

  • What kinds of circuits are available?

  • What is their territory?

  • How much data can they handle?

  • What do they cost (at a high level)?

  • How are they secured?

  • How stable are they (how often are they “down”)?

  • Are co-location services available (can the carrier host the company’s servers in its data centers?)

  • What other services does the carrier provide?

The responses to these questions are “scored” (assigned a numeric weighting), and the 2 top scoring vendors are issued a Request for Quote (RFQ). The RFQ goes into much more detail in terms of the contract the carrier is willing to offer. After extensive discussions and negotiations, Company X’s contract team awards the business to the carrier they believe will provide the greatest value.

The same approach is used to establish relationships with cloud vendors, software providers, and consultants. Because the approach is so consistent, it is considered a repeatable “process.” See the chapter on process management.

8.3.2. Outsourcing and cloudsourcing

… cloud computing has some fundamental characteristics that distinguish it from traditional outsourcing …​ many cloud services are merely passive providers of computing resources, utilized by users to perform their own processing.
— Millard et al.
Cloud Computing Law

The first significant vendor relationship the startup may engage with is for cloud services. The decision whether, and how much, to leverage cloud computing remains (as of 2016) a topic of much industry discussion. JP Morgenthal reports on a 2016 discussion with industry analysts that identified the following pros and cons [190]:

Table 16. Cloud sourcing pros and cons
Pro Con

Operational costs

Poor availability of skilled workforce for implemeting internal cloud

Better capital management (i.e. through expensed cloud services)

Difficulty in providing elastic scalability

Agility (faster provisioning of commercial cloud instances)

Building and operating data centers is expensive

Limiting innovation (e.g.,web-scale applications may require current cloud infrastructure)

Private clouds are poor imitation of public cloud experience

Poor capacity management / resource utilization

Data Gravity (scale of data to voluminous to easily migrate the apps and data to the cloud)

Security (perception cloud is not as secure)

Emerging scalable solutions for private clouds

Lack of equivalent SaaS offerings for applications being run in-house

Significant integration requirements between in house apps and new ones deployed to cloud

Lack of ability to support migration to cloud

Vendor licensing (see 9 figure true-up)

Network latency (slow response of cloud-deployed apps)

Poor transparency of cloud infrastructure

Risk of cloud platform lock-in

Cloud can reduce costs when deployed in a sophisticated way; if a company has purchased more capacity than it needs, cloud can be more economical (review the section on virtualization economics). However, ultimately as Abbott and Fisher point out,

Large, rapidly growing companies can usually realize greater margins by using wholly owned equipment than they can by operating in the cloud. This difference arises because laaS operators while being able to purchase and manage their equipment cost-effectively, are still looking to make a profit on their services [2 p. 474].

Minimally, cloud services need to be controlled for consumption. Cloud providers will happily allow virtual machines to run indefinitely, and charge for them. An ecosystem of cloud brokers and value-add resellers is emerging to assist organizations with optimizing their cloud dollars.

8.3.3. Software licensing

Case study: The 9-figure “true-up" A large enterprise had a long relationship with a major software vendor, who provided a critical software product used widely for many purposes across the enterprise.

The price for this product was set based on the power of the computer running it. A license would cost less for a computer with 4 cores and 1 gigabyte of RAM than it would for a computer with 16 cores and 8 gigabytes of RAM. The largest computers required the most expensive licenses.

As described previously, the goal of virtualization is to use one powerful physical computer to consolidate more lightly-loaded computers as “virtual machines.” This can provide significant savings.

Over the course of 3 years, the enterprise described here virtualized about 5,000 formerly physical computers, each of which had been running the vendor’s software.

However, a deadly wrinkle emerged in the software vendor’s licensing terms. The formerly physical computers were, in general, smaller machines. The new virtual farms were clusters of 16 of the most powerful computers available on the market. The vendor held that EACH of the 5,000 instances of its software running on the virtual machines was liable for the FULL licensing fee applicable to the most powerful machine!

Even though each of the 5,000 virtual machines could not possibly have been using the full capacity of the virtual farm, the vendor insisted (and was upheld) that the contract did not account for that, and there was no way of knowing whether any given VM had been using the full capacity of the farm at some point.

The dispute escalated to the CEOs of each company, but the vendor held firm. The enterprise was obliged to pay a “true-up” charge of over $100 million (9 figures).

This is not an isolated instance. Major software vendors have earned billions in such charges and continue to audit aggressively for these kinds of scenarios. This is why contracts and licenses should never be taken lightly. Even startups could be vulnerable, if licensed commercial software is used in un-authorized ways in a cloud environment, for example.

As software and digital services are increasingly used by firms large and small, the contractual rights of usage become more and more critical. We mentioned a "clickwrap” licensing agreement above. Software licensing, in general, is a large and detailed topic, one presenting a substantial financial risk to the unwary firm, especially when cloud and virtualization are concerned.

Software licensing is a subset of software asset management, which is itself a subset of IT asset management, discussed in more depth in the material on process management and IT lifecycles. Software asset management in many cases relies on the integrity of a digital organization’s package management; the package manager should represent a definitive inventory of all the software in use in the organization.

Free and open-source software (sometimes abbreviated FOSS) has become an increasingly prevalent and critical component of digital products. While technically “free,” the licensing of such software can present risks. In some cases, open-source products have unclear licensing that puts them at risk of legal conflicts which may impact users of the technology [174]. In other cases, the open-source license may discourage commercial or proprietary use; for example, the Gnu Public License (GPL) requirement for disclosing derivative works causes concern among enterprises [288].

8.3.4. The role of industry analysts

When a company is faced by a sourcing question of any kind, one initial reaction is to research the market alternatives. But research is time consuming, and markets are large and complex. Therefore, the role of industry or market analyst has developed.

In the financial media, one often hears from “industry analysts” who are concerned with whether a company is a good investment in the stock market. While there is some overlap, the industry analysts we are concerned with here are more focused on advising prospective customers of a given market’s offerings.

Because sourcing and contracting are an infrequent activity, especially for smaller companies, it is valuable to have such services. Because they are researching a market and talking to vendors and customers on a regular basis, analysts can be helpful to companies in the purchasing process.

However, analysts take money from the vendors they are covering as well, leading to occasional charges of conflict of interest [cite]. How does this work? There are a couple of ways.

First, the analyst firm engages in objective research of a given market segment. They do this by developing a common set of criteria for who gets included, and a detailed set of questions to assess their capabilities.

For example, an analyst firm might define a market segment of “Cloud Infrastructure as a Service” vendors. Only vendors supporting the basic NIST guidelines for Infrastructure as a Service are invited. Then, the analyst might ask, “Do you support Software Defined Networking, e.g., Network Function Virtualization” as a question. Companies that answer “yes” will be given a higher score than companies that answer “no.” The number of questions on a major research report might be as high as 300 or even higher.

Once the report is completed, and the vendors are ranked (analyst firms typically use a two-dimensional ranking, such as the Gartner Magic Quadrant or Forrester Wave), it is made available to end users for a fee. Fees for such research might range from $500 to $5000 or more, depending on how specialized the topic, how difficult the research, and the ability of prospective customers to pay.

Large companies, e.g., those in the Fortune 500, typically would purchase an “enterprise agreement," often defined as a named “seat” for an individual, who can then access entire categories of research.

Customers may have further questions for the analyst who wrote the research. They may be entitled to some portion of analyst time as part of their license, or they may pay extra for this privilege.

Beyond selling the research to potential customers of a market, the analyst firm has a complex relationship with the vendors they are covering. In our example of a major market research report, the analyst firm’s sales team also reaches out the vendors who were covered. The conversation goes something like this:

“Greetings. You did very well in our recent research report. Would you like to be able to give it away to prospective customers, with your success highlighted? If so, you can sponsor the report for $50,000.”

Because the analyst report is seen as having some independence, it can be an attractive marketing tool for the vendor, who will often pay (after some negotiating) for the sponsorship. In fact, vendors have so many opportunities along these lines they often find it necessary to establish a function known as “Analyst Relations” to manage all of them.

8.3.5. Software development and contracts

Customer collaboration over contract negotiation.
— Agile Manifesto
For both suppliers and buyers of Information Technology (IT) projects, one issue repeatedly arises: how to get out of the trap of fixed pricing without the disadvantages of time and materials contracts.
— Andreas Opelt et al.
Agile Contracts: Creating and Managing Successful Projects with Scrum
What do lawyers assume is the nature of software projects? First, it is common that they view it as similar to a construction project—relatively predictable—rather than the highly uncertain and variable research and development that it usually is. Second, that in the project (1) there is a long delay before something can be delivered that is well done, with (2) late and weak feedback, (3) long payment cycles, and (4) great problems for the customer if the project is stopped at any arbitrary point in time. These assumptions are invalidated in agile development.
— Arbogast et al.
Agile Contracts Primer

Software is often developed and delivered per contractual terms. Contracts are legally binding agreements, typically developed with the assistance of lawyers. As noted in [16] (p.5), “Legal professionals are trained to act, under legal duty, to advance their client’s interests and protect them against all pitfalls, seen or unseen.” The idea of “customer collaboration over contract negotiation” may strike them as the height of naïveté.

However, Agile and Lean influences have made substantial inroads in contracting approaches.

Arbogast et al. describe the general areas of contract concern:

  • Risk, exposure, and liability

  • Flexibility

  • Clarity of obligations, expectations, and deliverables

They argue that “An agile-project contract may articulate the same limitations of liability (and related terms) as a traditional-project contract, but the agile contract will better support avoiding the very problems that a lawyer is worried about.” (p. 12)

So, what is an "agile" contract?

There are two major classes of contracts:

  • Time and materials

  • Fixed price

In a time and materials contract, the contracting firm simply pays the provider until the work is done. This means that the risk of the project overrunning its schedule or budget resides primarily with the firm hiring out the work. While this can work, there is often a desire on the part of the firm to reduce this risk. If you are hiring someone because they claim they are experts and can do the work better, cheaper, and/or quicker than your own staff, it seems reasonable that they should be willing to shoulder some of the risks.

In a fixed price contract, the vendor providing the service will make a binding commitment that (for example) “we will get the software completely written in 9 months for $3 million.” Penalties may be enforced if the software is late, and it’s up to the vendor to control development costs. If the vendor does not understand the work correctly, they may lose money on the deal.

Reconciling Agile with fixed-price contracting approaches has been a challenging topic [202]. The desire for control over a contractual relationship is historically one of the major drivers of waterfall approaches. However, since requirements cannot be fully known in advance, this is problematic.

When a contract is signed based on waterfall assumptions, the project management process of change control is typically used to govern any alterations to the scope of the effort. Each change order typically implies some increase in cost to the customer. Because of this, the perceived risk mitigation of a fixed price contract may become a false premise.

This problem has been understood for some time. Scott Ambler argued in 2005 that “It’s time to abandon the idea that fixed bids reduce risk. Clients have far more control over a project with a variable, gated approach to funding in which working software is delivered on a regular basis” [12]. Andreas Opelt states, “For agile IT projects it is, therefore, necessary to find an agreement that supports the balance between a fixed budget (maximum price range) and agile development (scope not yet defined in detail) …”

How is this done? Opelt and his co-authors further argue that the essential question revolves around the project “iron triangle":

  • Scope

  • Cost

  • Deadline

The approach they recommend is determining which of these elements is the “fixed point” and which is estimated. In traditional waterfall projects, the scope is fixed, while costs and deadline must be estimated (a problematic approach when product development is required).

In Opelt’s view, in Agile contracting, costs and deadline are fixed, while the scope is “estimated” — understood to have some inevitable variability. "… you are never exactly aware of what details will be needed at the start of a project. On the other hand, you do not always need everything that had originally been considered to be important” [202].

Their recommended approach supports the following benefits:

  • Simplified adaptation to change

  • Non-punitive changes in scope

  • Reduced knowledge decay (large “batches” of requirements degrade in value over time)

This is achieved through:

  • Defining the contract at the level of product or project vision (epics or high-level stories; see discussion of Scrum) — not detailed specification

  • Developing high-level estimation

  • Establishing agreement for sharing the risk of product development variability

This last point, which Opelt et al. term “riskshare,” is key. If the schedule or cost expand beyond the initial estimate, both the supplier and the customer pay, according to some agreed %, which they recommend be between 30%-70%. If the supplier proportion is too low, the contract essentially becomes time and materials. If the customer proportion is too low, the contract starts to resemble traditional fixed-price.

Incremental checkpoints are also essential; for example, the supplier/customer interactions should be high bandwidth for the first few sprints, while culture and expectations are being established and the project is developing a rhythm.

Finally, the ability for either party to exit gracefully and with a minimum penalty is needed. If the initiative is testing market response (ala Lean Startup) and the product hypothesis is falsified, there is little point in continuing the work from the customer’s point of view. AND, if the product vision turns out to be far more work than either party estimated, the supplier should be able to walk away (or at least insist on comprehensive re-negotiation).

These ideas are a departure from traditional contract management. As Opelt asks, “How can you sign a contract from which one party can exit at any time?” Recall however that (if Agile principles are applied) the customer is receiving working software continuously through the engagement (e.g.,after every sprint).

In conclusion, as Arbogast et al. argue, “Contracts that promote or mandate sequential lifecycle development increase project risk … an agile approach …​ reduces risk because it limits both the scope of the deliverable and extent of the payment [and] allows for inevitable change” [16 p. 13].

8.4. Structuring the investment

Directors should monitor the progress of approved IT proposals to ensure that they are achieving objectives in required timeframes using allocated resources.
— ISO/IEC 38500:2008

Now that we understand the coordination problem better, and have discussed finance and sourcing, we are prepared to make longer term commitments to a more complicated organizational structure. As we stated in the chapter introduction, one way of looking at these longer term commitments is as investments. We start them, we decide to continue them, or we decide to halt (exit) them. In fact, we could use the term “portfolio” to describe these various investments; this is not a new concept in IT management.

The first comparison of IT investments to a portfolio was in 1974, by Richard Nolan in Managing the Data Resource Function [198].

Whatever the context for your digital products (external or internal), they are intended to provide value to your organization and ultimately your end customer. Each of them in a sense is a “bet” on how to realize this value (review the Spotify DIBB model), and represents in some sense a form of product discovery. As you deepen your abilities to understand investments, you may find yourself applying business case analysis techniques in more rigorous ways, but as always retaining a Lean Startup experimental mindset is advisable.

As you strengthen a hypothesis in a given product or feature structure, you increasingly formalize it: a clear product vision supported by dedicated resources. We’ll discuss the IT portfolio concept further in Chapter 12. In your earliest stages of differentiating your portfolio, you may first think about features versus components.

8.4.1. Features versus components

feature component matrix
Figure 144. Features versus components

As you consider your options for partitioning your product, in terms of the AKF scaling cube, a useful and widely-adopted distinction is that between “features” and “components” (see Features versus components).

Features are what your product does. They are what the customers perceive as valuable. “Scope as viewed by the customer” according to Mark Kennaley [151] p. 169. They may be "flowers" -— defined by the value they provide externally, and encouraged to evolve with some freedom. You may be investing in new features using Lean Startup, the Spotify DIBB model or some other hypothesis-driven approach.

Components are how your product is built, such as database and Web components. In other words, they are a form of infrastructure (but infrastructure you may need to build yourself, rather than just spin up in the cloud). They are more likely to be “cogs” — more constrained and engineered to specifications. Mike Cohn defines a component team as “a team that develops software to be delivered to another team on the project rather than directly to users” [67 p. 183].

Feature teams are dedicated to a clearly defined functional scope (such as “item search” or “customer account lookup”), while component teams are defined by their technology platform (such as “database” or “rich client”). Component teams may become shared services, which need to be carefully understood and managed (more on this to come). A component’s failure may affect multiple feature teams, which makes them riskier.

It may be easy to say that features are more important than components, but this can be carried too far. Do you want each feature team choosing its own database product? This might not be the best idea; you’ll have to hire specialists for each database product chosen. Allowing feature teams to define their own technical direction can result in brittle, fragmented architectures, technical debt, and rework. Software product management needs to be a careful balance between these two perspectives. The Scaled Agile Framework suggests that components are relatively

  • more technically focused

  • more generally re-usable

than features. SAFE also recommends a ratio of roughly 20-25% component teams to 75%-80% feature teams [236].

Mike Cohn suggests the following advantages for feature teams [67 pp. 183-184]:

  • They are better able to evaluate the impact of design decisions

  • They reduce hand-off waste (a coordination problem)

  • They present less schedule risk

  • They maintain focus on delivering outcomes

He also suggests [67 pp. 186-187] that component teams are justified when:

  • Their work will be used by multiple teams

  • They reduce the sharing of specialists across teams

  • The risk of multiple approaches outweighs the disadvantages of a component team

Ultimately, the distinction between “feature versus component” is similar to the distinction between “application” and “infrastructure". Features deliver outcomes to people whose primary interests are not defined by digital or IT. Components deliver outcomes to people whose primary interests are defined by digital or IT.

8.4.2. Epics and new products

In the last chapter, we talked of one product with multiple feature and/or component teams (see One company, one product). Features and components as we are discussing them here are large enough to require separate teams (with new coordination requirements). At an even larger scale, we have new product ideas, perhaps first seen as epics in a product backlog.

one product
Figure 145. One company, one product

Eventually, larger and more ambitious initiatives lead to a key organizaitonal state transition: from one product to multiple products. Consider our hypothetical startup company. At first, everyone on the team is supporting one product and dedicated to its success. There is little sense of contention with “others” in the organization. This changes with the addition of a second product team with different incentives (see One company, multiple products). Concerns for fair allocation and a sense of internal competition naturally arise out of this diversification. Fairness is deeply wired into human (and animal) brains, and the creation of a new product with an associated team provokes new dynamics in the growing company.

multi product
Figure 146. One company, multiple products

Because resources are always limited, it is critical that the demands of each product be managed using objective criteria, requiring formalization. This was a different problem when you were a tight-knit startup; you were constrained, but everyone knew they were “in it together.” Now you need some ground rules to support your increasingly diverse activities. This leads to new concerns:

  • Managing scope and preventing unintended creep or drift from the product’s original charter

  • Managing contention for enterprise or shared resources

  • Execution to timeframes (e.g.,the critical trade show)

  • Coordinating dependencies (e.g.,achieving larger, cross-product goals)

  • Maintaining good relationships when a team’s success depends on another team’s commitment.

  • Accountability for results

Structurally, we might decide to separate a portfolio backlog from the product backlog. What does this mean?

  • The portfolio backlog is the list of potential new products that the organization might invest in

  • Each product team still has its own backlog of stories (or other representations of their work)

The DEEP backlog we discussed in Chapter 5 gets split accordingly (see Portfolio versus product backlog).

Figure 147. Portfolio versus product backlog

The decision to invest in a new product should not be taken lightly. When the decision is made, the actual process is as we covered in Chapter 4: ideally, a closed-loop, iterative process of discovering a product that is valuable, usable, and feasible.

There is one crucial difference: the investment decision is formal and internal. While we started our company with an understanding of our investment context, we looked primarily to market feedback and grew incrementally from a small scale. (Perhaps there was venture funding involved, but this book doesn’t go into that).

Now, we may have a set of competing ideas that we are thinking about placing bets on. In order to make a rational decision, we need to understand the costs and benefits of the proposed initiatives. This is difficult to do precisely, but how can we rationally choose otherwise? We have to make some assumptions and estimate the likely benefits and the effort it might take to realize them.

8.5. Larger-scale planning and estimating

8.5.1. Why plan?

Fundamentally, we plan for two reasons:

  • To decide whether to make an investment

  • To ensure the investment’s effort progresses effectively and efficiently.

We’ve discussed investment decision making in terms of the overall business context, in terms of the product roadmap, the product backlog, and in terms of Lean Product Development and cost of delay. As we think about making larger-scale, multi-team digital investments, all of these practices come together to support our decision making process. Estimating the likely time and cost of one or more larger-scale digital product investments is not rocket science; doing so is based on the same techniques we have used at the single-team, single-product level.

With increasing scope of work and increasing time horizon tends to come increasing uncertainty. We know that we will use fast feedback and ongoing hypothesis-driven development to control for this uncertainty. But at some point, we either make a decision to invest in a given feature or product and starting the hypothesis testing cycle -— or we don’t.

Once we have made this decision, there are various techniques we can use to prioritize the work so that the most significant risks and hypotheses are addressed soonest. But in any case, when large blocks of funding are at issue, there will be some expectation of monitoring and communication. In order to monitor, we have to have some kind of baseline expectation to monitor against. Longer-horizon artifacts such as the product roadmap and release plan are usually the basis for monitoring and reporting on product or initiative progress.

In planning and execution, we seek to optimize the following contradictory goals:

  • Delivering maximum value (outcomes)

  • Minimizing the waste of un-utilized resources (people, time, equipment, software)

Obviously, we want outcomes -— digital value -— but we want it within constraints. It has to be within a timeframe that makes economic sense. If we pay forty people to do work that a competitor or supplier can do with three, we have not produced a valuable outcome relative to the market. If we take twelve months to do something that someone else can do in five, again, our value is suspect. If we purchase software or hardware we don’t need (or before we need it) and as a result, our initiative’s total costs go up relative to alternatives, we again may not be creating value. Many of the techniques suggested here are familiar to formal project management. Project management has the deepest tools, and whether or not you use a formal project structure, you will find yourself facing similar thought processes as you scale.

To meet these value goals, we need to:

  • estimate so that expected benefits can be compared to expected costs, ultimately to inform the investment decision (start, continue, stop)

  • plan so that we understand dependencies (e.g.,when one team must complete a task before another team can start theirs)

Projecting expected benefits is challenging. One of the most useful references for such questions is the book How to Measure Anything: Finding the Value of Intangibles in Business by Doug Hubbard [127].

Estimation sometimes causes controversy. When a team is asked for a projected delivery date, the temptation for management is to “hold them accountable” for that date and penalize them for not delivering value by then. But product discovery is inherently uncertain, and therefore such penalties can seem arbitrary. Experiments show that when animals are penalized unpredictably, they fall into a condition known as “learned helplessness,” in which they stop trying to avoid the penalties [284].

We discussed various coordination tools and techniques previously. Developing plans for understanding dependencies is one of the best known such techniques. An example of such a planning dependency would be that the database product should be chosen and configured before any schema development takes place (this might be a component team working with a feature team).

8.5.2. Planning larger efforts

…​ many large projects need to announce and commit to deadlines many months in advance, and many large projects do have interteam dependencies …​
— Mike Cohn
Agile Estimating

Agile and adaptive techniques can be used to plan larger, multi-team programs. Again, we have covered many fundamentals of product vision, estimation, and work management in earlier chapters. Here, we are interested in the concerns that emerge at a larger scale, which we can generally class into:

  • Accountability

  • Coordination

  • Risk management


With larger blocks of funding comes higher visibility and inquiries as to progress. At a program level, differentiating between estimates and commitments becomes even more essential.


Mike Cohn suggests that larger efforts specifically can benefit from the following coordination techniques [66]:

  • Estimation baseline (velocity)

  • Key details sooner

  • Lookahead planning

  • Feeding buffers

Estimating across multiple teams is difficult without a common scale, and Cohn proposes an approach for determining this in terms of team velocity. He also suggests that in larger projects, some details will require more advance planning (it is easy to see APIs as being one example), and some team members' time should be devoted to planning for the next release. Finally, where dependencies exist, buffers should be used -— that is, if Team A needs something from Team B by May 1, Team B should plan on delivering it by April 15.

Risk management

Finally, risk and contingency planning is essential. In developing any plan, Abbott and Fisher recommend the “5-95 rule": 5% of the time on building a good plan, and 95% of the time planning for contingencies [2 p. 105]. We’ll discuss risk management in detail in Chapter 10.

8.6. Why project management?

An ongoing work effort is generally a repetitive process that follows an organization’s existing procedures. In contrast, because of the unique nature of projects, there may be uncertainties or differences in the products, services, or results that the project creates.
— Project Management Body of Knowledge
version 5
Projects of all types and sizes are now the way that organizations accomplish their work [emphasis added].
— Stanley Portny
Project Management for Dummies 4th ed.
… the project as a vehicle of IT execution has, by and large, failed to live up to its promise of predictable delivery.
— Sriram Narayan
“Scaling Agile: Problems and Solutions"
Project management responsibilities are no longer exercised by one person. They are split across the members of the Scrum team instead.
— Roman Pichler
“Agile Product Management with Scrum"
Agile is having a profound impact on the project management profession and will cause us to fundamentally rethink many of the well-established notions of what a project manager is …
— Charles G. Cobb
The Project Manager's Guide to Mastering Agile: Principles and Practices for an Adaptive Approach

In our emergence model, we always seek to make clear why we need a new concept or practice. It is not sufficient to say, “we need project management because companies of our size use it.” Many authoritative books on Agile software development assume that some form of project management is going to be used. Other authors question the need for it or at least raise various cautions.

Project management, like many other areas of IT practice, is undergoing a considerable transformation in response to the Agile transition. However, it will likely remain an important tool for value delivery at various scales.

Fundamentally, project management is a means of understanding and building a shared mental model of a given scope of work. In particular, planning the necessary tasks gives a basis for estimating the time and cost of the work as a whole, and therefore understanding its value. Even though industry practices are changing, value remains a critical concern for the digital professional.

As the above quotes indicate, there are diverse opinions on the role and importance of traditional project management in the enterprise. Clearly, it is under pressure from the Agile movement. Project management professionals are advised not to deny or diminish this fact. One of the primary criticisms of project management as a paradigm is that it promotes large “batches” of work. It is possible for a modern, IT-centric organization to make considerable progress on the basis of product management plus simple, continuous work management, without the overhead of the formalized project lifecycle suggested by PMBOK.

Cloud computing is having impacts on traditional project management as well. As we will see in the section on the decline of traditional IT, projects were often used to install vendor-delivered commodity software, such as for payroll or employee expense. Increasingly, that kind of functionality is delivered by online service providers, leaving “traditional” internal IT with considerably reduced responsibilities.

Some of the IT capability may remain in the guise of an internal “service broker,” to assist with the sourcing and procurement of online services. The remainder moves into digital product management, as the only need for internal IT is in the area of revenue-generating, market-facing strategic digital product.

So, this section will examine the following questions:

  • Given the above trends, under what circumstances does formalized project management make economic sense?

  • Assuming that formalized project management is employed, how does one continue to support objectives such as fast feedback and adaptability?

8.6.1. A traditional IT project

So, what does all this have to do with IT? As we have discussed in previous chapters, project management is one of the main tools used to deliver value across specialized skill-based teams, especially in traditional IT organizations.

A “traditional” IT project would usually start with the “sponsorship” of some executive with authority to request funding. For example, suppose that the VP of Logistics under the Chief Operating Officer (COO) believes that a new supply chain system is required. With the sponsorship of the COO, she puts in a request (possibly called a “demand request” although this varies by organization) to implement this system. The assumption is that a commercial software package will be acquired and implemented. The IT department serves as an overall coordinator for this project. In many cases, the “demand request” is registered with the enterprise Project Management Office, which may report to the CIO.

Why might the Enterprise Project Management office report under the CIO? IT projects in many companies represent the single largest type of internally managed capital expenditure. The other major form of projects, building projects, are usually outsourced to a general contractor.

The project is initiated by establishing a charter, allocating the funding, assigning a project manager, establishing communication channels to stakeholders, and a variety of other activities. One of the first major activities of the project will be to select the product to be used. The project team (perhaps with support from the architecture group) will help lead the RFI/RFQ processes by which vendors are evaluated and selected.

RFI stands for Request for Information; RFQ stands for Request for Quote. See the links for definitions.

Once the product is chosen, the project must identify the staff who will work on it, perhaps a combination of full time employees and contractors, and the systems implementation lifecycle can start.

We might call the above, the systems implementation lifecycle, not the software development lifecycle. This is because most of the hard software development was done by the third party who created the supply chain software. There may be some configuration or customization (adding new fields, screens, reports) but this is lightweight work in comparison to the software engineering required to create a system of this nature.

The system requires its own hardware (servers, storage, perhaps a dedicated switch) and specifying this in some detail is required for the purchasing process to start. The capital investment may be hundreds of thousands or millions of dollars. This, in turn, requires extensive planning and senior executive approval for the project as a whole.

It would not have been much different for a fully in-house developed application, except that more money would have gone to developers. The slow infrastructure supply chain still drove much of the behavior, and correctly “sizing” this infrastructure was a challenge particularly for in-house developed software. (The vendors of commercial software would usually have a better idea of the infrastructure required for a given load). Hence, there is much attention to up-front planning. Without requirements there is no analysis or design; without design, how do you know how many servers to buy?

Ultimately, the project comes to an end, and the results (if it is a product such as a digital service) are transitioned to a “production” state. Traditional IT implementation lifecycle presents a graphical depiction.

Figure 148. Traditional IT implementation lifecycle

There are a number of problems with this classic model, starting with the lack of responsiveness to consumer needs (see Customer responsiveness in traditional model).

Figure 149. Customer responsiveness in traditional model

This might be OK for a non-competitive function, but if the “digital service consumer” has other options, they may go elsewhere. If they are an internal user within an enterprise, they might be engaged in critical competitive activities.

The decline of the “traditional” IT project

The above scenario is in decline, and along with it a way of life for many “IT” professionals. One primary reason is cloud, and in particular SaaS. Another reason is the increasing adoption of the Lean/Agile product development approach for digital services. Traditional enterprise IT “space” presents one view of the classic model.

Figure 150. Traditional enterprise IT “space”

Notice the long triangles labeled “Producing focus” and “Consuming focus.” These represent the perspectives of (for example) a software vendor versus their customer. Traditionally, the R&D functions were most mature within the product companies. What was less well understood was that internal IT development was also a form of R&D. Because of the desire for scope management (predictability and control), the IT department performing systems development was often trapped in the worst of both worlds — having neither a good quality product nor high levels of certainty. For many years, this was accepted by the industry as the best that could be expected. However, the combination of Lean/Agile and cloud is changing this situation (see Shrinking space for traditional IT).

There is diminishing reason to run commodity software (e.g.,payroll, expenses, HR, etc.). in-house. Cloud providers such as Workday, Concur, Salesforce, and others provide ready access to the desired functionality “as a service.” The responsiveness and excellence of such products are increasing, due to the increased tempo of market feedback (note that while a human resource management system may be a commodity for your company, it is strategic for Workday) and concerns over security and data privacy are rapidly fading.

What is left internal to the enterprise, increasingly, are those initiatives deemed “competitive” or “strategic.” Usually, this means that they are going to contribute to a revenue stream. This, in turn, means they are “products” or significant components of them. (See Chapter 4, Product Management). A significant market-facing product initiative (still calling for project management per se) might start with the identification of a large, interrelated set of features, perhaps termed an “epic.” Hardware acquisition is a thing of the past, due to either private or public cloud. The team starts with analyzing the overall structure of the epic, decomposing it into stories and features, and organizing them into a logical sequence.

new model
Figure 151. Shrinking space for traditional IT

Because capacity is available on-demand, new systems do not need to be nearly as precisely “sized,” which meant that implementation could commence without as much up front analysis. Simpler architectures suffice until the real load is proven. It might then be a scramble to refactor software to take advantage of new capacity, but the overall economic effect is positive, as over-engineering and over-capacity are increasingly avoided. So, IT moves in two directions — its most forward-looking elements align to the enterprise product management roadmap, while its remaining capabilities may deliver value as a “service broker.” (More on this in the section on IT sourcing).

Let’s return to the question of project management in this new world.

8.6.2. How is a project different from simple “work management"?

In Chapter 5, we covered a simple concept of “work management” that deliberately did not differentiate product, project, and/or process-based work. As was noted at the time, for smaller organizations, most or all of the organization would be the “project team,” so what would be the point?

The project is starting off as a list of tasks, that is essentially identical to a product backlog. Even in Kanban, we know who is doing what, so what is the difference? Here are key points:

  • The project is explicitly time-bound. As a whole, it is lengthier and more flexible than the repetitive, time-boxed sprints of Scrum, but more fixed than the ongoing flow of Kanban.

  • Dependencies. You may have had a concept of one task or story blocking another, and perhaps you used a white board to outline more complex sequences of work, but project management has an explicit concept of dependencies in the tasks and powerful tools to manage them. This is essential in the most ambitious and complex product efforts.

  • Project management also has more robust tools for managing people’s time and effort, especially as they translate to project funding. While estimation and ongoing re-planning of spending can be a contentious aspect of project management, it remains a critical part of management practice in both IT and non-IT domains.

At the end of the day, people expect to be paid for their time, and investors expect to be compensated through the delivery of results. Investment capital only lasts as a function of an organization’s “burn rate;” the rate at which the money is consumed for salaries and expenses. Some forecasting of status (whether that of a project, organization, product, program, or what have you) is, therefore, an essential and unavoidable obligation of management unless funding is unlimited (a rare situation to say the least).

Project accounting, at scale, is a deep area of considerable research and theory behind it. In particular, the concept of Earned Value Management is widely used to quantify the performance of a project portfolio.

8.6.3. The “iron triangle”

Iron triangle
Figure 152. Project “Iron Triangle”

The project management Iron Triangle represents the interaction of cost, time, scope, and quality of a project (see Project “Iron Triangle” [67]). The idea is that, in general, one or more of these factors may be a constraint. The “Pick any Two” sign is often seen in service organizations (see Pick any two [68]).

Figure 153. Pick any two

The same applies to project management and reflects well the “iron triangle” of trade-offs. However, more recent thinking in the DevOps movement suggests that optimizing for continuous flow and speed tends to have beneficial impacts on quality as well. As digital pipelines increase their automation and speed to delivery, quality also increases because testing and building become more predictable. Conversely, the idea that stability increases through injecting delay into the deployment process (i.e. through formal Change Management) is also under question (see [95]).

8.6.4. Project practices

Project management (NOT restricted to IT) is a defined area of study, theory, and professional practice. This section provides a (necessarily brief) overview of these topics.

We will first discuss the Project Management Body of Knowledge, which is the leading industry framework in project management, at least in the United States. (PRINCE2 is another framework, originating from the UK, which will not be covered in this edition). We will spend some time on the critical issues of scope management which drive some of the conflicts seen between traditional project management and Agile product management.

PMBOK details are easily obtained on the web, and will not be repeated here. (See the PMBOK summary and project management overview). It’s clear that the Agile critiques of waterfall project management have been taken seriously by the PMBOK thought leaders. There is now a PMI Agile certification and much explicit recognition of the need for iterative and incremental approaches to project work.

PMBOK remains extensive and complex when considered as a whole. This is necessary, as it is used to manage extraordinarily complex and costly efforts in domains such as construction, military/aerospace, government, and others. Some of these efforts (especially those involving systems engineering, over and above software engineering) do have requirements for extensive planning and control that PMBOK meets well.

However, in Agile domains that seek to be more adaptive to changing business dynamics, full use of the PMBOK framework may be unnecessary and wasteful. The accepted response is to “tailor” the guidance, omitting those plans and deliverables that are not needed.

Part of the problem with extensive frameworks such as PMBOK is that knowing how and when to tailor them is hard-won knowledge that is not part of the usual formalized training. And yet, without some idea of “what matters” in applying the framework, there is great risk of wasted effort. The Agile movement in some ways is a reaction to the waste that can result from overly detailed frameworks.
Scope management

Scope management is a powerful tool and concept, at the heart of the most challenging debates around project management. PMBOK uses the following definitions [215]:

Scope. The sum of the products, services, and results to be provided as a project. See also project scope and product scope.

Scope Change. Any change to the project scope. A scope change almost always requires an adjustment to the project cost or schedule.

Scope Creep. The uncontrolled expansion of product or project scope without adjustments to time, cost, and resources.

Change Control A process whereby modifications to documents, deliverables, or baselines associated with the project are identified, documented, approved, or rejected.

In the Lean Startup world, products may pivot and pivot again, and their resource requirements may flex rapidly based on market opportunity. Formal project change control processes are in general not used. Even in larger organizations, product teams may be granted certain leeway to adapt their “products, services, and results” and while such adaptations need to be transparent, formal project change control is not the vehicle used.

On the other hand, remember our emergence model. The simple organizational change from one to multiple products may provoke certain concerns and a new kind of contention for resources. People are inherently competitive and also have a sense of fairness. A new product team that seems to be unaccountable for results, consuming “more than its share” of the budget while failing to meet the original vision for their existence, will cause conflict and concern among organizational leadership.

It is in the tension between product autonomy and accountability that we see project management techniques such as the work breakdown structure and project change control employed. The work breakdown structure is defined by the Project Management Body of Knowledge as

… a hierarchical decomposition of the total scope of work to be carried out by the project team to accomplish the project objectives and create the required deliverables. The WBS organizes and defines the total scope of the project, and represents the work specified in the current approved project [215].

[214] recommends “Subdivide your WBS component into additional deliverables if you think either of the following situations applies: The component will take much longer than two calendar weeks to complete. The component will require much more than 80 person-hours to complete.”

This may seem reasonable, but in iterative product development, it can be difficult to “decompose” a problem in the way project management seems to require. Or to estimate in the way Portny suggests. This can lead to two problems.

First, the WBS may be created at a seemingly appropriate level of detail, but since it is created before key information is generated, it is inevitably wrong and needing ongoing correction. If the project management approach requires a high-effort “project change management” process, much waste may result as “approvals” are sought for each feedback cycle. This may result in increasing disregard by the development team for the project manager and his/her plan, and corresponding cultural risks of disengagement and lowering of trust on all sides.

Second, we may see the creation of project plans that are too high-level, omitting information that is in fact known at the time — for example, external deadlines or resource constraints. This happens because the team develops a cultural attitude that is averse to all planning and estimation.

Project risk management

Project management is where we see the first formalization of risk management (which will be more extensively covered in Chapter 10). Briefly, risk is classically defined as the probability of an adverse event times its cost. Project managers are alert to risks to their timelines, resource estimates, and deliverables.

Risks may be formally identified in project management tooling. They may be accepted, avoided, transferred, or mitigated. Unmanaged risks to a project may result in the project as a whole reporting an unfavorable status.

Project assignment

Enterprise IT organizations have evolved to use a mix of project management, processes, and ad hoc work routing to achieve their results. Often, resources (people) are assigned to multiple projects; a practice sometimes called “fractional allocation.”

In fractional allocation, a database administrator will work 25% on one project, 25% on another, and still be expected to work 50% on ongoing production support. This may appear to work mathematically, but practically it is an ineffective practice. Both Gene Kim in The Phoenix Project [153] and Eli Goldratt in Critical Chain [108] present dramatized accounts of the overburden and gridlock that can result from such approaches.

As previously discussed, human beings are notably bad at multi-tasking, and the mental “context-switching” required to move from one task to another is wasteful and ultimately not scalable. A human being fractionally allocated to more and more projects will get less and less done in total, as the transactional friction of task switching increases.

Governing outsourced work

A third major reason for the continued use of project management and its techniques is governing work that has been outsourced to third parties. This is covered in detail in the section on sourcing.

8.6.5. The future of project management

Recall our three “Ps":

  • Product

  • Project

  • Process

Taken together, the three represent a coherent set of concerns for value delivery in various forms. But in isolation, any one of them ultimately is limited. This is a particular challenge for project management, whose practitioners may identify deeply with their chosen field of expertise.

Clearly, formalized project management is under pressure. Its methods are perceived by the Agile community as overly heavyweight; its practitioners are criticized for focusing too much on success in terms of cost and schedule performance and not enough on business outcomes. Because projects are by definition temporary, project managers have little incentive to care about technical debt or operational consequences. Hence the rise of the product manager.

However, a product manager who does not understand the fundamentals of project execution will not succeed. As we have seen, modern products, especially in organizations scaling up, have dependencies and coordination needs, and to meet those needs, project management tools will continue to provide value.

Loose coupling to the project plan rescue? While this book does not go into systems architectural styles in depth, a project with a large number of dependencies may be an indication that the system or product being constructed also has significant interdependencies. Recall Amazon’s product strategy including its API mandate.

Successful systems designers for years have relied on concepts such as encapsulation, abstraction, and loose coupling to minimize the dependencies between components of complex systems so that their design, construction, and operation can be managed with some degree of independence. These ideas are core to the software engineering literature. Recent expressions of these core ideas are Service-Oriented Architecture and microservices.

Systems that do not adopt such approaches are often termed “monolithic” and have a well deserved reputation for being problematic to build and operate. Many large software failures stem from such approaches. If you have a project plan with excessive dependencies, the question at least should be asked: does my massive, tightly-coupled project plan indicate I am building a monolithic, tightly-coupled system that will not be flexible or responsive to change?

Again, many digital companies build tremendously robust integrated services from the combination of many quasi-independent, microservice-based “product” teams, each serving a particular function. However, when a particular organizational objective requires changes to more than one such “product,” the need for cross-team coordination emerges. Someone needs to own this larger objective, even if its actual implementation is carried out across multiple distinct teams. We will discuss this further in Chapter 9.

8.7. Topics

8.7.1. Critical chain

Author Eli Goldratt in the book Critical Chain develops a sophisticated critique of project estimation and the dysfunctions it promotes.

In a project requiring contributions from multiple skilled resources, a common practice is to ask each person, “how long will this take you?” The project manager then works the resulting estimates into the overall project plan.

The problem with this is that most people will estimate their time conservatively; they will forecast a longer duration than they actually require. When all these “padded” estimates are added together, the project may be unacceptably long. The agreed work will tend to expand to fill the time available . Furthermore, most people will wait until the end of their window to perform their task — a person who asks for 3 weeks to perform one week of work will often not start until week 3 -— otherwise known as Student Syndrome.

One of the reasons that people estimate conservatively is that project managers tend to be quite concerned if committed tasks are not performed on time. Failure to make the “deliverable” by the committed date may result in negative feedback to the employee’s manager and subsequently result in poor performance reviews. When coupled with the above-cited drive to multi-tasking, these factors result in poor project performance, despite the array of modern project management techniques.

Goldratt suggested an alternate approach, in which the idea of "critical path” is enhanced with resource awareness. That is to say, the issue of timing and dependencies (itself a complex problem) is further enriched with the availability of resources to perform the work. (In general, the availability of assigned project resources is assumed, but this is not a wise assumption in project-centric environments).

Estimation is handled more probabilistically, and the “critical chain” is the combination of the critical path plus the resource assigned to complete the most critical task. The theory is that a person performing such a task must be protected from distraction, and in fact, project managers must expand their tools to forecast effectively and plan the critical chain.

This leads to some complex math, in particular, a known problem called the Resource-Constrained Scheduling Problem. (e.g., The fact that this problem is so notoriously difficult is indicative of the need for adaptive approaches; ultimately, rigorous analytic methods fail to cope with the complexity of such problems.

Craig Larman, in Scaling Lean and Agile Development, is sympathetic to the overall insights and goals of Critical Chain. However, with respect the full blown analytical approach it implies, he states

“We have seen two very large official “project management TOC” adoption attempts (and heard of one more) in companies developing software-intensive embedded systems … The practice was clearly heavy, not agile, and not lean. In all three cases, the approach was eventually found cumbersome and not very effective, and was dropped.” [168]

8.7.2. The Agile project frameworks

As of this writing, a number of frameworks have been developed at the intersection of Agile and project management. Notable examples include:

Other Agile authors are skeptical of the need for such material [240].

8.8. Conclusion

This chapter is titled “Investments and projects,” and represents the middle ground between the foundation of the organizational structure, and the day-to-day execution of work. Project management will likely remain a significant practice for digital professionals, although there are organizations who achieve significant results without it (i.e., using continuous flow approaches across fixed teams).

Investments in vendor relationships and the overall approach to tracking the financials of digital work also affect both the organization and its ongoing work execution. While Agile and related practices provide new insights and directions, there are fundamental and unchanging challenges to managing these areas.

8.8.1. Discussion questions

  • As a team, compare & discuss the costs of cloud services to acquiring and running your own servers.

  • What experience do you or your team members have with project management? How effective did you find it?

  • Imagine yourself in an organization that recognized product management and work management, but had no concept of project management.

    • Discuss a scenario where project management would be a reasonable technique to introduce.

    • Discuss a scenario where project management would not make sense.

  • Do microservices/continuous delivery/DevOps render traditional PM obsolete? Discuss.

8.8.2. Research & practice

  • Review the marketing literature of the following companies. What do you understand of their products for IT financial management? Why would you need one?

    • Apptio

    • Nicus

  • Review all the text of a clickwrap license (e.g.,when you upgrade a popular piece of software). Do you see anything surprising?

  • Find a free version of a Gartner Magic Quadrant, Forrester Wave, or similar analyst report. Study it. What was the incentive for the product company to make this report available to you? Does that make you suspicious of its conclusions? Why or why not?

  • Compare Microsoft Project with one of the following. What are the pros and cons?

    • Rally

    • Jira

    • VersionOne

    • Asana

    • LeanKit

  • Develop a feature or release plan for your product, using the Abbott 5-95 rule.

  • Compare and contrast this Cohn article in favor of project management, with these skeptical articles by Narayam, Arnold, and Memon.

9. Organization and culture

9.1. Introduction

In the last section, we introduced the AKF scaling cube, and we now start the second half of the book based on a related thought experiment. As your team-based company grew, you reached a crisis point in scaling your digital product. Your single team could no longer cope as one unit with the increasing complexity and operational demands. In AKF cube terms, you are scaling along the y-axis, the hardest but in some ways the most important dimension to know how to scale along.

You are going through a critical phase, the “team of teams” transition. You have increasingly specialized people delivering an increasingly complex product, or perhaps even several distinct products. Deep-skilled employees are great, but you’ve noticed they tend to lose the big picture. You are in constant discussions around the tension between functional depth versus product delivery. And when you go from one team to multiple, the topic of the organization must be formalized.

You often think about how your company should be structured. There is no shortage of opinions there either. From functional centers of excellence to cross-functional product teams, and from strictly hierarchical models to radical models like holacracy, there seems to be an infinite variety of choices.

A structure needs to be filled with the right people. How can you retain that startup feel you had previously, with things getting this big? You’ve always known intuitively that great hires are the basis for your company’s success, but now you need to think more systematically about hiring. Finally, the people you hire will create your company’s culture. Many of your employees and consultants emphasize the role of culture, but what do they mean? Is there such a thing as a “good” culture? How is one culture better than another?

Ultimately, as you move into leadership, you realize that your concern for organization and culture is all about creating the conditions for success. You can’t drive success as an individual any more; that is increasingly for others to do. All you can do is set the stage for their efforts, provide overall direction and vision, and let your teams do what they do best.

This chapter proceeds in a logical order, from operational organization forms, to populating them by hiring staff, to the hardest to change questions of culture.

9.1.1. Outline

  • IT organization versus product organization

  • Product and function

  • Defining the organization

  • Waterfall and functional organization

    • The continuum of organizational forms

    • From functions to components to shared services

  • IT human resource management

    • Basic concerns

    • Hiring

    • Allocation and tracking people’s time

    • Accountability and performance

  • Why culture matters

    • Motivation

    • Schneider and Westrum

    • Toyota Kata

9.1.2. Learning objectives

  • Identify and describe the factors driving an organization’s differentiation into a multiple-team structure

  • Identify and describe the organizational challenges and issues that result when a multiple-team structure is instituted

  • Distinguish between functional versus product organizations

  • Compare and contrast various organizational forms

  • Describe basic issues in digital hiring and human resource management

  • Identify and discuss various concepts and aspects of culture in digital organizations

9.2. IT versus product organization

In the early stages, you have to hire generalists who are both willing and able to take on dozens of tasks at once. Your developers will have to speak with potential customers; your accountants will have to give advice on product direction, and the natural salesman on your team will need to put the phone down a few hours a day and set up a new employee’s computer. This is the exciting, four-people-and-an-idea stage popularly associated with startups— but it doesn’t last very long…​For a lot of your employees, growing out of this phase will be a welcome development: programmers don’t want to be in accounting meetings, and salespeople don’t want to sit in a dark, quiet room with the engineers. People have talents and skills they want to develop, and a healthy degree of specialization allows them to do that.
— Matt Blumberg
Scaling Up
Some of the most important factors that organizational structure can affect are communication, efficiency, standards, quality, and ownership.
— Abbot
and Fisher

So, you are getting bigger now, and are no longer one single team. As the quote from Matt Blumberg’s Scaling Up indicates, you are becoming more specialized. Even as a cohesive, single team you had specialized support services. You needed legal and accounting advice to get the startup going. When you started hiring people, you needed HR and payroll. You also are buying things and paying bills, so you need bookkeeping, and you’ve got sales people and marketing people, and you need to support your customers and collect money they have promised to pay. At some point, you need an internal person whose daily job is money (they’ll become the CFO someday).

How are you going to organize? More importantly, how are you going to think about organizing? This is a hard question. It’s important to get organization right, and the question never goes away. As you grow and split into teams, the overall sense of mission that you felt as a single team is at risk of fading. How can you keep your “eyes on the prize” when there are mulitple teams that all see the world slightly differently? It’s critical, as you start to explore various coordination mechanisms, to remember that the team remains the highest-value logical unit, and must be protected. As we discussed in Chapter 4, product value is created by co-located, cross-functional, highly collaborative teams.

In keeping with our emergence model, let’s assume you’ve been fairly ad hoc in your organizational structure up to now, doing your best to avoid specialization. Perhaps you’ve even been working as a collective. Nevertheless, you’ve needed a variety of skills to get this far in your journey: you are certainly not all Java programmers! A critical decision you will have to make: do you want a traditional “IT versus Business” structure or a product-based organization?

When I am in business meetings, I hear people talk about digital as a function or a role. It is not. Digital is a capability that needs to exist in every job. Twenty years ago, we broke e-commerce out into its own organization, and today e-commerce is just a part of the way we work. That’s where digital and IT are headed; IT will be no longer be a distinct function, it will just be the way we work. …

…​we’ve moved to a flatter organizational model with “teams of teams” who are focused on outcomes. These are colocated groups of people who own a small, minimal viable product deliverable that they can produce in 90 days. The team focuses on one piece of work that they will own through its complete lifecycle…in [the “back office”] model, the CIO controls infrastructure, the network, storage, and makes the PCs run. The CIOs who choose to play that role will not be relevant for long… [122]
— Jim Fowler
General Electric Chief Information Officer

There are two major models that digital professionals may encounter in their career:

  • The traditional centralized back-office “IT” organization

  • Digital technology as a component of market-facing product management

filing cabinets
Figure 154. Paper filing system

The traditional IT organization started decades ago, with “back-office” goals like replacing file clerks and filing cabinets (see Paper filing system [69]) with faster and more accurate computers. (We will go into further detail in Chapter 11). At that time, computers were not flexible or reliable, business did not move as fast, and there was a lot of value to be gained in relatively simple efforts like converting massive paper filing systems into digital systems. As these kinds of efforts grew and became critical operational dependencies for companies, the role of Chief Information Officer was created, to head up increasingly large organizations of application developers, infrastructure engineers, and operations staff.

The business objectives for such organizations centered on stability and efficiency. Replacing 300 file clerks with a system that didn’t work, or that wound up costing more was obviously not a good business outcome! On the other hand, it was generally accepted that designing and implementing these systems would take some time. They were complex, and many times the problem had never been solved before. New systems might take years -— including delays -— to come online, and while executives might be unhappy, oftentimes the competition wasn’t doing much better. CIOs were conditioned to be risk-averse; if systems were running, changing them was scrutinized with great care and rarely rushed.

The culture and practices of the modern IT organization became more established, and while it retained a reputation for being slow, expensive, and inflexible, no-one seemed to have any better ideas. It didn’t hurt that the end customer wasn’t interested in computers.

early Amazon appearance
Figure 155. Amazon early version

Then along came Apple, Microsoft, and the dot-com boom (see Amazon early version [70]). Suddenly everyone had personal computers at home and was on the Internet. Buying things! Computers continued to become more reliable and powerful as well. Companies realized that their back-office IT organizations were not able to move fast enough to keep up with the new e-commerce challenge, and in many cases organized their Internet team outside of the CIO’s control (which sometimes made the traditional IT organization very unhappy). Silicon Valley startups such as Google and Facebook in general did not even have a separate “CIO” organization, because for them (and this is a critical point) the digital systems were the product. Going to market against tough competitors (Alta Vista and Yahoo against Google, Friendster and MySpace against Facebook) wasn’t a question of maximizing efficiency. It was about product innovation and effectiveness and taking appropriate risks in the quest for these rapidly growing new markets.

Let’s go back to our example of the traditional CIO organization. A typical structure under the CIO might look as shown in Classic IT organization.

org chart
Figure 156. Classic IT organization

(We had some related discussion in Chapter 6). Such a structure was perceived to be “efficient” because all the server engineers would be in one organization, while all the Java developers would be in another, and their utilization could be managed for efficiency. Overall, having all the “IT” people together was also considered efficient, and the general idea was that “the business” (Sales, Marketing, Operations, and back-office functions like Finance and HR) would define their "requirements" and the IT organization would deliver systems in response. It was believed that organizing into "centers of excellence” (sometimes called organizing by function) would make the practices of each center more and more effective, and therefore more valuable to the organization as a whole. However, the new digital organizations perceived that there was too much friction between the different functions on the organization chart. Skepticism also started to emerge that “centers of excellence” were living up to their promise. Instead, what was too often seen was the emergence of an “us versus them” mentality, as developers argued with the server and network engineers.

One of the first companies to try a completely different approach was Intuit. As Intuit started selling its products increasingly as services, it re-organized and assigned individual infrastructure contributors, e.g., storage engineers and database administrators, to the product teams with which they worked [2], p 103.

org chart
Figure 157. New IT organization

This model is also called the "Spotify model” (see New IT organization). The dotted line boxes (Developers, Quality Assurance, Engineering) are no longer dedicated “centers of excellence” with executives leading them. Instead, they are lighter-weight “communities of interest” organized into chapters and guilds. The cross-functional product teams are the primary way work is organized and understood, and the communities of interest play a supporting role. Henrik Kniberg provided one of the first descriptions of how Spotify organizes along product lines [159]. (Attentive readers will ask, “What happened to the PMO? and what about security?” There are various answers to these questions, which we will continue to explore in Part III).

The consequences of this transition in organizational style are still being felt and debated. Sriram Narayan, is in general an advocate of product organization. However, in his book Agile Organization Design, he points out that “IT work is labor-intensive and highly specialized,” and therefore managing IT talent is a particular organizational capability it may not make sense to distribute [195]. Furthermore, he observes that IT work is performed on medium to long time scales, and “IT culture” differs from “business culture,” concluding that "although a merger of business and IT is desirable, for the vast majority of big organizations it isn’t going to happen anytime soon."

Conversely, Abbott and Fisher in The Art of Scalability argue that "…​The difference in mindset, organization, metrics, and approach between the IT and product models is vast. Corporate technology governance tends to significantly slow time to market for critical projects…​IT mindsets are great for internal technology development, but disastrous for external product development” [2 pp. 122-124]. However, it is possible that Abbott and Fisher are overlooking the decline of traditional IT. Hybrid models exist, with “product” teams reporting up under “business” executives, and the CIO still controlling the delivery staff who may be co-located with those teams. We’ll discuss the alternative models in more detail below.

Conway’s Law

So who was Conway and why is his law so important as we move to a team of teams? Melvin Conway is a computer programmer who worked on early compilers and programming languages. In 1967 he proposed the thesis that:

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure [70].

What does this mean? If we establish two teams, each team will build a piece of functionality (a feature or component). They will think in terms of “our stuff” and “their stuff” and the interactions (or interface) between the two. Perhaps this seems obvious, but as you scale up, it’s critical to keep in mind. In particular, as you segment your organization along the AKF y-axis, you will need to keep in mind the difference between features and components. You are on a path to have dozens or hundreds of such teams. The decisions you make today on how to divide functionality and work will determine your operating model far into the future.

Ultimately, Conway’s law tells us that to design a product is also to design an organization and vice versa. Important for designers and architects to remember.

9.3. Defining the organization

There are many different ways we can apply these ideas of traditional functional organizing versus product-oriented organizing, and features versus components. How does one begin to decide these questions? As a digital professional in a scaling organization, you need to be able to lead these conversations. The cross-functional, diverse, collaborative team is a key unit of value in the digital enterprise, and its performance needs to be nurtured and protected.

Pigs versus chickens in Scrum

A well known theme from Scrum is the idea of “pigs versus chickens.” It comes from a (not very funny) joke:

A chicken says to a pig, “Hey pig, let’s start a restaurant."

Pig: “I’m not sure, what would we call it?"

Chicken: “How about, 'Ham and Eggs'?"

Pig: “No thanks, I’d be committed, but you’d only be involved."

Regardless of the humor, there’s an important point here: when teams are fully committed, fully accountable for delivering value, the value they deliver increases. When work is flowing across functional areas, each area is just “involved, ” and only the project manager may be “committed” (if he or she even is). When we flip the organization to a value-centric model, multi-tasking decreases and commitment increases.

We discuss this here in the team of teams section, as the assumption in Part I and Part II is that everyone was committed because it was a small, single-team organization.

Abbott and Fisher suggest the following criteria when considering organizational structures [2 p. 12]:

  • How easily can I add or remove people to/from this organization? Do I need to add them in groups, or can I add individual people?

  • Does the organizational structure help or hinder the development of metrics that will help measure productivity?

  • Does the organizational structure allow teams to own goals and feel empowered and capable of meeting them?

  • Which types of conflict will arise, and will that conflict help or hinder the mission of the organization?

  • How does this organizational structure help or hinder innovation within my company?

  • Does this organizational structure help or hinder the time to market for my products?

  • How does this organizational structure increase or decrease the cost per unit of value created?

  • Does work flow easily through the organization, or is it easily contained within a portion of the organization?

9.3.1. Team persistence

The traditional approach is that we have work and we bring “the right people” to the work. To be nimble -— to have organizational agility -— we have to have great teams of people and bring the work to the team [93].
— Joseph Flahiff

Team persistence is a key question. The practice in project-centric organizations has been temporary teams, that are created and broken down on an annual or more frequent basis. People “rolled on” and “rolled off” of projects regularly in the common heavyweight project management model. Often, contention for resources resulted in fractional project allocation, as in “you are 50% on Project A and 25% on Project B” which could be challenging for individual contributors to manage. With team members constantly coming and going, developing a deep, collective understanding of the work was difficult. Hard problems benefit from team stability. Teams develop both a deeper rational understanding of the problem, as well as emotional assets such as psychological safety. Both are disrupted when even one person on a team changes. Persistent teams of committed individuals also (in theory) reduce destructive multi-tasking and context-switching.

Author’s note

As of fall 2016, the trend towards persistent, product-centric teams and away from continually changing project-based teams is noticeable in the Minneapolis/St. Paul area. A related phenomenon is multiple anecdotes of companies abandoning Project Management Offices. --Charles Betz

9.4. Product and function

Even where they are not part of a value stream, activity-oriented teams tend to standardize their operations over time. Their appetite for offering custom solutions begins to diminish. Complaints begin to surface—“They threw the rule book at us,” “What bureaucracy!”
— Sriram Narayan
Agile IT Organization Design
When teams are aligned by services, are autonomous, and are cross-functionally composed, there is a significant decrease in affective conflict. When team members are alignment with shared goals and no longer need to argue about who is responsible or who should perform certain tasks, the team wins or loses together. Everyone on the team is responsible for ensuring the service provided meets the business goals.
— Abbot
and Fisher

By this time, you probably detect that there is a fundamental tension between functional specialization and end to end value delivery. The above two quotes reflect this tension — the tendency for specialist teams start to identify with their specialty and not the overall mission. The tension may go by different names:

  • Product versus function

  • Value stream versus activity

  • Process versus silo

As we saw previously, there are three major concepts used to achieve an end-to-end flow across functional specialties:

  • Product

  • Project

  • Process

These are not mutually exclusive models and may interact with each other in complex ways. (See the scaling discussion in the Part III introduction).

9.4.1. Waterfall and functional organization

For example, some manufacturing can be represented as a very simple, sequential process model (see Simple sequential manufacturing).

manufacturing sequence
Figure 158. Simple sequential manufacturing

The product is already defined, and the need to generate information (i.e. through feedback) is at an absolute minimum. NOTE: Even in this simplest model, feedback is important. Much of the evolution of 20th century manufacturing has been in challenging this naive, open-loop model. (Remember our brief discussion of open-loop?) The original, open-loop waterfall model of IT systems implementation (see Waterfall) was arguably based on just such a naive concept.

Figure 159. Waterfall

(Review chapter 3 on waterfall development and Agile history). Functional, or practice, areas can continually increase their efficiency and economies of scale through deep specialization.

What is a “practice"?

A “practice” is synonymous with “discipline” -— it is a set of interrelated precepts, concerns, techniques, often with a distinct professional identity. “Java programming,” “security,” or “capacity management” are practices. When an organization is closely identified with a practice, it tends to act as a functional silo (more on this to come). For example, in a traditional IT organization, the Java developers might be a separate team from the HTML, CSS and JavaScript specialists. The database administrators might have their own team, and also the architects, business analysts, and quality assurance groups. Each practice or functional group develops a strong professional identity as the custodians of “best practices” in their area. They may also develop a strong set of criteria for when they will accept work, which tends to slow down product discovery.

There are two primary disadvantages to the model of projects flowing in a waterfall sequence across functional areas:

  • It discourages closed-loop feedback

  • There is transactional friction at each handoff

Go back and review: the waterfall model falls into the “original sin” of IT management, confusing production with product development. As a repeatable production model, it may work, assuming that there is little or no information left to generate regarding the production process (an increasingly questionable assumption in and of itself). But when applied to product development, where the primary goal is the experiment-driven generation of information, the model is inappropriate and has led to innumerable failures. This includes software development, and even implementing purchased packages in complex environments.

9.4.2. The continuum of organizational forms

The following discussion and accompanying set of diagrams is derived from Preston Smith and Don Reinertsen’s thought regarding this problem in Developing Products in Half the Time [255] and Managing the Design Factory [220]. Similar discussions are found in the Guide to the Project Management Body of Knowledge [215] and Abbott and Fisher’s The Art of Scalability [2].

There is a spectrum of alternatives in structuring organizations for flow across functional concerns. First, a lightweight “matrix” project structure may be implemented, in which the project manager has limited power to influence the activity-based work, where people sit, etc. (see Lightweight project management across functions).

matrix figure
Figure 160. Lightweight project management across functions

Work flows across the functions, perhaps called "centers of excellence,” and there may be contention for resources within each center. Often, simple “first in, first out” queuing approaches are used to manage the ticketed work , rather than more sophisticated approaches such as cost of delay. It is the above model that Reinertsen was thinking of when he said: “The danger in using specialists lies in their low involvement in individual projects and the multitude of tasks competing for their time.” Traditional infrastructure and operations organizations, when they implemented defined service catalogs, can be seen as attempting this model. (More on this in the discussion of ITIL and shared services.

Second, a heavyweight project structure may specify much more, including dedicated time assignment, modes of work, standards, and so forth (see Heavyweight project management across functions). The vertical functional manager may be little more than a resource manager, but does still have reporting authority over the team member and crucially still writes their annual performance evaluation (if the organization still uses those). This has been the most frequent operating model in the traditional CIO organization.

matrix figure
Figure 161. Heavyweight project management across functions

If even more focus is needed -— the now-minimized influence of the functional areas is still deemed too strong -— the organization may move to completely product-based reporting (see Product team, virtual functions). With this, the team member reports to the product owner. There may still be communities of interest (Spotify guilds and tribes are good examples) and there still may be standards for technical choices.

matrix figure
Figure 162. Product team, virtual functions

Finally, in the skunkworks model, all functional influence is deliberately blocked, as distracting or destructive to the product team’s success (see Skunkworks model).

matrix figure
Figure 163. Skunkworks model

The product team has complete autonomy and can move at great speed. It is also free to:

  • re-invent the wheel, developing new solutions to old and well-understood problems

  • bring in new components on a whim (regardless of whether they are truly necessary) adding to sourcing and long-term support complexity,

  • ignore safety and security standards, resulting in risk and expensive retrofits.

Early e-commerce sites were often set up as skunkworks to keep the interference of the traditional CIO to a minimum, and this was arguably necessary. However, ultimately, skunkworks is not scalable. Research by the Corporate Executive Board suggests that “Once more than about 15% of projects go through the fast [skunkworks] team, productivity starts to fall away dramatically.” It also causes issues with morale, as a two-tier organization starts to emerge with elite and non-elite segments [110].

Because of these issues, Don Reinertsen observes that “Companies that experiment with autonomous teams learn their lessons, and conclude that the disadvantages are significant. Then they try to combine the advantages of the functional form with those of the autonomous team” [220].

The Agile movement is an important correction to dominant IT management approaches employing open-loop delivery across centralized functional centers of excellence. However, the ultimate extreme of the skunkworks approach cannot be the basis for organization across the enterprise. While functionally specialized organizations have their challenges, they do promote understanding and common standards for technical areas. In a product-centric organization, communities of interest or practice provide an important counterbalancing platform for coordination strategies to maintain common understandings.

9.4.3. Scaling the product organization

The functional organization scales well. Just keep hiring more Java programmers, or DBAs, or security engineers and assign them to projects as needed. However, scaling product organizations requires more thought. The most advanced thinking in this area is found in the work of Scrum authors such as Ken Schwaber, Mike Cohn, Craig Larman and Roman Pichler. Scrum, as we have discussed, is a strict, prescriptive framework calling for self-managing teams with

  • Product owner

  • Scrum master

  • Team member

Figure 164. Product owner hierarchy

Let’s accept Scrum and the 2-pizza team as our organizing approach. A large scale Scrum effort is based on multiple small teams, e.g., representing AKF scaling cube partitions (see Product owner hierarchy [71], p. 12; [239]). If we want to minimize multi-tasking and context-switching, we need to ask “how many product teams can a given product owner handle?” In Agile Product Management with Scrum, Roman Pichler says, “My experience suggests that a product owner usually cannot look after more than two teams in a sustainable manner” [211 p. 12]. Scrum authors, therefore, suggest that larger scale products be managed as aggregates of smaller teams. We’ll discuss how the product structure is defined in Chapter 8.

9.4.4. From functions to components to shared services

We have previously discussed feature _versus_component teams. As a reminder, features are functional aspects of software (things people find directly valuable) while components are how software is organized (e.g.,shared services and platforms such as data management).

As an organization grows, we see both the feature and component sides scale. Feature teams start to diverge into multiple products, while component teams continue to grow in the guise of shared services and platforms. Their concerns continue to differentiate, and communication friction may start to emerge between the teams. How an organization handles this is critical.

In a world of digital products delivered as services, both feature and component teams may be the recipients of ongoing investment. An ongoing objection in discussions of Agile is, “We can’t put a specialist on every team!” This objection reflects the increasing depth of specialization seen in the evolving digital organization. Ultimately, it seems there are two alternatives to handling deep functional specialization in the modern digital organization:

  • Split it across teams

  • Turn it into an internal product

We’ve discussed the first option above (split the specialty across teams). But for the second option consider for example the traditional role of server engineer (a common infrastructure function). Such engineers historically have had a consultative, order-taking relationship to application teams:

  1. An application team would identify a need for computing capacity (“we need four servers”)

  2. The infrastructure engineers would get involved and provide recommendations on make, model, and capacity

  3. Physical servers would be acquired, perhaps after some debate and further approval cycles

Such processes might take months to complete, and often caused dissatisfaction. With the rise of cloud computing, however, we see the transition from a consultative, order-taking model to an automated, always-on, self-service model. Infrastructure organizations move their activities from consulting on particular applications to designing and sustaining the shared, self-service platform. At that point, are they a function or a product?

9.5. Final thoughts on organization forms

Formal organizational structures determine, to a great extent, how work gets done. But enterprise value requires that organizational units -— whether product or functional -— collaborate and coordinate effectively. Communications structures and interfaces, as covered in Chapter 7, are therefore an essential part of organizational design.

And of course, an empty structure is meaningless. You need to fill it with real people, which brings us to the topic of human resource management in the digital organization.

9.6. IT human resource management

Now that you have decided, for now, on an organization structure, you need to put people into it. As you scale, hiring people (like managing money) becomes a practice requiring formalization, and you will doubtless need to hire your first HR professional soon, in order to stay compliant with applicable laws and regulations.

9.6.1. Basic concerns

Human Resource management can also be termed "people management.” It is distinct from supply chain and technology management, being concerned with the identification and recruitment, onboarding, ongoing development of, and eventual exit of individuals from an organization. This brief section covers the topic as it relates to digital management, incorporating recent cases and perspectives.

9.6.2. Hiring

hiring is one of the most important things a software organization does. Every good hire accelerates your organization; every poor hire is a drag on your organization.
— Sean Landis
Agile Hiring
In procedural work, the best are 2x better than the average. In creative/inventive work, the best are 10x better than the average, so [we place a] huge premium on creating effective teams of the best.
— Reed Hastings
Netflix Culture: Freedom and Responsibility

Here is a typical hiring process:

  • Solicit candidates, through various channels such as job boards and recruiters

  • Review resumes and narrow candidate pool down for phone interviews

  • Conduct phone interviews and narrow candidate pool down for in person interviews

  • Conduct in person interviews, identify candidates for offers

  • Make offer, negotiate to acceptance

  • Hire and onboard

Your organization has been hiring people for some time now. It’s always been one of your most important decisions, but you have reached the point where a more formal and explicit understanding of how you do this is essential.

The costs of a toxic hire

Consider: Recent research by Michael Housman and Dylan Minor suggests that while the benefit from hiring a highly qualified “superstar” worker at most is $5,303, the cost of hiring a “toxic” worker (one destructive of morale and team norms) averages $12,489 — certainly a risk to consider [125].

First, why do you hire new staff? How and when do you perceive a need? It is well established that increasing the size of a team can slow it down. Legendary software engineer Fred Brooks, in his work The Mythical Man Month, identified the pattern that “adding more people to a late project makes it later” [37].

Are you adding people because of a perceived need for specialist skills? While this will always be a reason for new hires, many argue in favor of “T-shaped” people — people who are deep in one skill, and broad in others. Hiring new staff has an impact on culture — is it better to train from within, or source externally?

Second, how are you hiring staff? In a traditional, functionally specialized model, someone in a Human Resources organization acts as an intermediary between the hiring manager, and the job applicant (sometimes with a recruiter also in the mix between the company and the applicant). Detailed job descriptions are developed, and applicants not explicitly matching the selection criteria (e.g.,in their resume) are not invited for an interview.

Such practices do not necessarily result in good outcomes. Applicants routinely tailor resumes to the job description. In some cases, they have been known to copy the job description into invisible sections of their resume so that they are guaranteed a “match” if automated resume-scanning software is used.

A compelling case study of the limitations of traditional HR-driven hiring is discussed by Robert Sutton and Huggy Rao in Scaling up Excellence: Getting to More without Settling for Less [263]. The authors describe the company Lotus Software, one of the pioneers of early desktop computing.

With [company founder] Kapor’s permission, [head of organizational development] Klein pulled together the resumes of the first forty Lotus employees…​[and] submitted all forty resumes to the Lotus human resources department. Not one of the forty applicants, including Kapor, was invited for a job interview. The founders had built a world that rejected people like them.

Sean Landis, author of Agile Hiring [242], believes that

“accepted hiring wisdom is not very effective when applied to software professionals.” He further states that:

  • very few companies hire well;

  • individuals with deep domain knowledge are in the best position to perform great hiring;

  • companies often focus on the wrong candidates; and

  • it is important to track metrics on the cost and effectiveness of hiring practices.

In short, hiring is one of the most important decisions the digital enterprise makes, and it cannot be reduced to a simple process to be executed mechanically. Requiring senior technical talent to interview candidates may result in improved hiring decisions. However, such requirements add to the overall work demands placed on these individuals.

9.6.3. Process as skill

Sometimes new employees come in expecting that you are following certain processes. This is in part because “process” experience can be an important part of an employee’s career background. A skilled HR manager may consider their experience with large-scale enterprise hiring processes to be a major part of their qualifications for a position in your company.

This applies to both “business” and “IT” processes. In fact, in the digital world, there is no real difference. Digital processes:

  • Initiate new systems, from idea to construction

  • Publicize and grant access to the new systems

  • Capture revenue from the systems

  • Support people in their interactions with the systems

  • Fix the systems when they break

  • Improve the systems based on stakeholder feedback

It’s not clear which of these are “IT” versus “business” processes. But they are definitely processes. Some of them are more predictable, some less so, but they all represent some form of ordered work that is repeatable to some degree. And to some extent, you may be seeking people with experience defined at least in part by their exposure to processes.

9.6.4. Allocation and tracking people’s time

punch clock and cards
Figure 165. Time clock and punch cards

When a new hire enters your organization, they enter a complex system that will structure and direct their daily activities through a myriad of means. The various means that direct their action include:

  • Team assignment (e.g.,to an ongoing product)

  • Project assignment

  • Process responsibilities

Notice again the appearance of the "3 Ps."

Product, project, and process become challenging when they are all allowed to generate demand on individuals independently of each other. In the worst case scenario, the same individual winds up with:

  • Collaborative team responsibilities

  • “Fractional” allocation to one or more projects

  • Ticketed process responsibilities

Fractional allocation is the practice of funding individuals through assigning them at some % to a project. For example, a server engineer might be allocated 25% time to a project for 6 months to define its infrastructure architecture, while being assigned 30% to another project to refresh obsolete infrastructure.

When demand is un-coordinated, these multiple channels can result in multi-tasking and dramatic overburden, and in the worst case, the individual becomes the constraint to enterprise value. Project managers expect deliverables on time, and too often have no visibility to operational concerns (e.g.,outages) that may affect the ability of staff to deliver. Ad hoc requests “smaller than a project, bigger than a ticket” further complicate matters.

The Phoenix Project presents an effective and realistic dramatization of the resulting challenges. Work is entering the system through multiple channels, and the overburden on key individuals (such as Brent, the lead systems engineer) has reached crisis proportions. Through a variety of mechanisms, they take control of the demand channels and greatly improve the organization’s success. One of the most important lessons is well articulated by Erik, the mentor:

“Your job as VP of IT Operations is to ensure the fast, predictable, and uninterrupted flow of planned work that delivers value to the business while minimizing the impact and disruption of unplanned work, so you can provide stable, predictable, and secure IT service…​You must figure out how to control the release of work into IT Operations and, more importantly, ensure that your most constrained resources are doing only the work that serves the goal of the entire system, not just one silo. [153 p. 91]

In order to understand the work, measuring the consumption of people’s time is important. There are various time tracking approaches:

  • Simple allocation of staff payroll to product or organizational line

  • Project management systems (sometimes these are used for weekly time tracking, even for staff that are not assigned to projects — in such cases, placeholder operational projects are created)

  • Human Resource Management systems

  • Ticketing/workflow systems — advanced systems, such as those found in the Professional Services Automation sector, track time when tickets are in an “open” status.

  • Backlog management systems (that may seem similar to ticketing systems)

  • Home built systems

There is little industry consensus on best practices here. There are reasonable concerns about the burden of time tracking on employees, and poor data quality resulting from employees attempting to “code” activities when summarizing their time on a weekly or bi-weekly basis. [72]

9.6.5. Accountability and performance

When individuals in a system are forced to compete with others’ accomplishments, this increases the challenge of effective collaboration. Clear, transparent communication is not perceived as valuable to the individual, as having information can impact rewards, career advancement, and even whether an individual has a job.
— Jennifer Davis and Katherine Daniels
Effective DevOps

Regardless of whether the company is a modern digital enterprise or more traditional in its approach, the commitment, performance, and results of employees is a critical concern. The traditional approach to managing this has been an annual review cycle, resulting in a performance ranking from 1-5:

  1. Did not meet expectations

  2. Partially met expectations

  3. Met expectations

  4. Exceeded expectation

  5. Significantly exceeded expectations

This annual rating determines the employee’s compensation and career prospects in the organization. Some companies, notably GE and Microsoft, have attempted “stack rankings” in which the “bottom” 10% (or more) performers must be terminated. As the Davis and Daniels quote above indicates, such practices are terribly destructive of psychological safety and therefore team cohesion. High profile practitioners are therefore moving away from this practice [39], [201].

The traditional annual review is a large “batch” of feedback to the employee, and therefore ineffective in terms of systems theory, not much better than an open-loop approach. Because of the weaknesses of such slow feedback (not to mention the large annual costs, expensive infrastructure, and opportunity costs of the time spent), companies are experimenting with other approaches.

Deloitte Consulting, as reported in the Harvard Business Review [41], realized that its annual performance review process was consuming two million hours of time annually, and yet was not delivering the needed value. In particular, ratings were suffering from the measurable flaw that they tended to reveal more about the person doing the rating, than the person being rated!

They started by redefining the goals of the performance management system to identify and reward performance accurately, as well as further fueling improvements.

A new approach with greater statistical validity was implemented, based on four key questions:

  • Given what I know of this person’s performance, and if it were my money, I would award this person the highest possible compensation increase and bonus

  • Given what I know of this person’s performance, I would always want him or her on my team

  • This person is at risk for low performance

  • This person is ready for promotion today

In terms of the frequency of performance check-ins, they note:

the best team leaders …​ conduct regular check-ins with each team member about near term work . . . to set expectations for the upcoming week, review priorities, comment on recent work and provide course correction, coaching, or important new information…​If a leader checks in less often than once a week, the team member’s priorities may become vague . . . the conversation will shift from coaching for near term work to giving feedback about past performance…​If you want people to talk about how to do their best work in the near future, they need to talk often…​

Sutton and Rao, in Scaling up Excellence, discuss the similar case of Adobe. At Adobe, “annual reviews required 80,000 hours of time from the 2,000 managers at Adobe each year, the equivalent of 40 full-time employees. After all that effort, internal surveys revealed that employees felt less inspired and motivated afterwards— and turnover increased.” Because of such costs and poor results, Adobe scrapped the entire performance management system in favor of a “check-in” approach. In this approach, managers are expected to have regular conversations about performance with employees and are given much more say in salaries and merit increases. The managers themselves are evaluated through random “pulse surveys” that measure how well each manager “sets expectations, gives and receives feedback, and helps people with their growth and development.” [263 p. 113].

Whether incentives (e.g.,pay raises) should be awarded individually or on a team basis is an ongoing topic of discussion in the industry. Results often derive from team performance, and the contributions of any one individual can be difficult to identify. Because of this, Scrum pioneer Ken Schwaber argues that “The majority of the enterprise’s bonus and incentive funds need to be allocated based on the team’s performance rather than the individual’s performance.” [239 p. 6]. However, this runs into another problem: that of the “free-rider.” What do we do about team members who do not pull their weight? Even in self-organizing teams, confronting someone about their behavior is not something people do willingly, or well.

Ideally, teams will self-police, but this becomes less effective with scale. In one case study in the Harvard Business Review, Continental Airlines found that the free rider problem was less of a concern when metrics were clearly correlated with team activity. In their case, the efforts and cooperation of gate teams had a significant influence on On-Time Arrival and Departure metrics, which could then be used as the basis for incentives [156].

Ultimately, both individuals and teams need coaching and direction. Team-level management and incentives must still be supplemented with some feedback loops centering on the individual. Perhaps this feedback is not compensation-based, but the organization must still identify individuals with leadership potential and deal with free riders and toxic individuals.

Observed behaviors are a useful focus. Sean Landis describes the difference between behaviors and skills thus:

Two things make good leaders: behaviors and skills. If you focus on behaviors in your hiring of developers, they will be predisposed for leadership success. The hired candidate may walk in the door with the skills necessary to lead or not. If not, skills are easy to acquire through training and mentoring. People can acquire or modify behaviors, but it is much harder than skill development. Hire for behaviors and train the leadership skills. [242]

He further provides many examples of behaviors, such as:

  • Adaptable

  • Accountable

  • Initiative Taker

  • Optimistic

  • Relational

Many executives and military leaders have identified the central importance of hiring decisions. In large, complex organizations, choosing the right people is the most powerful lever a leader has to drive organizational performance. The organizational context these new hires find themselves in will profoundly affect them and the results of their efforts.

9.7. Why culture matters

What is culture in this context? It is not so much about an informal dress code, flexible hours, or a free in-house cafeteria as it is about how decisions are taken, norms of behavior, protocols of communication, and the ways of navigating hierarchy and bureaucracy to get things done.
— Sriram Narayan
Agile IT Organization Design

“Culture” is a difficult term to define, and even more difficult to characterize across large organizations. It starts with how an organization is formally structured, because structure is, in part, a set of expectations around how information flows. “Who talks to who, when and why” is in a sense culture. Culture can also be seen embedded in artifacts like processes and formally specified operating models.

But “culture” has additional, less tangible meanings. The anecdotes executives choose to repeat are culture. Whether an organization tacitly condones being 5 minutes late for meetings (because walk time in large facilities is expected) or has little tolerance for this (because most people dial in) is culture. The degree of deference shown to senior executives, and their opinions, is culture. Whether a junior person dares to hit “reply-all” on an email including her boss’s boss is culture. Organizational tolerance for competitive or toxic behavior is culture.

Culture cannot be directly changed — it is better seen as a lagging indicator, that changes in response to specific practical interventions. Even tools and processes can change the culture, if they are judiciously chosen (most tools and processes do not have this effect). Skeptical? Consider the impact that computers — a tool — have had on culture. Or email.

We’ve already touched on culture in the chapter 4 discussion of team formation. These themes of psychological safety, equal collaboration, emotional awareness, and diversity inform our further discussions. We’ll look at culture from a few additional perspectives in this section:

  • Motivation

  • Schneider matrix

  • The Westrum typology

  • Mike Rother’s research into Toyota’s improvement and coaching “katas”

9.7.1. Motivation

One of the most important reasons to be concerned about culture is its effect on motivation. There is little doubt that a more motivated team performs better than an unmotivated, “going through the motions” organization. But what motivates people?

One of the oldest discussions of culture is Douglas McGregor’s idea of Theory X” versus “Theory Y organizations, which he developed in the 1960s at the Massachusetts Institute of Technology.

“Theory X” organizations rely on extrinsic motivators and operate on the assumption that workers must be cajoled and punished in order to produce results. We see Theory X approaches when organizations focus on pay scales, bonuses, titles, awards, writeups/demerits, performance appraisals, and the like.

Theory Y organizations operate on the assumption that most people seek meaningful work intrinsically and that they have the ability to solve problems in creative ways that do not require tight standardization. According to Theory Y, people can be trusted and should be treated as mature individuals, in contrast to the distrust inherent in Theory X.

Related to Theory Y, in terms of intrinsic motivation, Daniel Pink, the author of Drive, suggests that three concepts are key: autonomy, mastery, and purpose. If these three qualities are experienced by individuals and teams, they will be more likely to feel motivated and collaborate more effectively.

9.7.2. Schneider and Westrum

One model for understanding culture is the matrix proposed by William Schneider (see Schneider matrix) [73].]:

Figure 166. Schneider matrix

Two dimensions are proposed:

  • the extent to which the culture is focused on the company or the individual

  • the extent to which the company is “possibility-oriented” versus “reality-oriented”

This is not a neutral matrix. It’s not clear that highly controlling cultures are ever truly effective. Even in the military, which is generally assumed to be the ultimate “command and control” culture, there are notable case studies of increased performance when more empowering approaches were encouraged.

Is the military "command and control"?

Military commanders realized as long ago as the Napoleonic wars that denying soldiers and commanders autonomy in the field was a good way to lose battles. Even in peacetime operations, forward-thinking military commanders continue to focus on “what, not how.”

In Turn the Ship Around: A True Story of Turning Followers Into Leaders, Captain L. David Marquette discusses moving from a command-driven to an outcome-driven model, and the beneficial results it had on the USS Santa Fe [180]. Similar themes appear in Captain D. Michael Abrashoff’s It’s Your Ship: Management Techniques from the Best Damn Ship in the Navy [3].

Neither of these accounts is surprising when one considers the more sophisticated aspects of military doctrine. Don Reinertsen provides a rigorous overview in chapter 9 of Principles of Product Development Flow. In this discussion, he notes that the military has been experimenting with centralized versus decentralized control for centuries. Modern warfighting relies on autonomous, self-directed teams that may be out of touch with central command and required to improvise effectively to achieve the mission. Therefore, military orders are incomplete without a statement of “commander’s intent” — the ultimate outcome of the mission. [221], pp. 243-265. Military leaders are concerned with pathological "toxic command” which is just as destructive in the military as anywhere else [276].

Similar to the Schneider matrix is the Westrum typology, which proposes that there are three major types of culture:

  • Pathological

  • Bureaucratic

  • Generative

The cultural types exhibit the following behaviors:

Table 17. Westrum typology
Pathological (Power-oriented) Bureaucratic (Rule-oriented) Generative (Performance-oriented)

Low cooperation

Modest cooperation

High cooperation

Messengers (of bad news) shot

Messengers neglected

Messengers trained

Failure is punished

Failure leads to justice

Failure leads to inquiry

(excerpted from [216])

The State of DevOps research has demonstrated a correlation between generative cultures and digital business effectiveness [216], [38]. Notice also the relationship to blameless postmortems discussed in Chapter 6.

State of DevOps survey research

DevOps is a broad term, first introduced in Chapter 3. As noted in that chapter, DevOps includes continuous delivery, team behavior and product management, and culture. Puppet Labs has sponsored an annual survey for the last 5 years, the State of DevOps report. It consists of annual surveys with (as of 2017) 25,000 individual data points. It shows a variety of correlations including:

  • Core continuous delivery practices such as version control, test automation, deployment automation, and continuous integration increase team engagement and IT and organizational performance

  • Lean product management approaches such as seeking fast feedback and splitting work into small batches also increase team engagement and IT and organizational performance [38].

9.7.3. Toyota Kata

Six years ago I began the research that led to [Toyota Kata] thinking, like just about everyone else, that the story was about techniques and other listable aspects of Toyota. Today I see Toyota in a notably different light: as an organization defined primarily by the unique behavior routines, it continually teaches to all its members.
— Mike Rother
Toyota Kata

Academics and consultants have been studying Toyota for many years. The performance and influence of the Japanese automaker are legendary, but it has been difficult to understand why. Much has been written about Toyota’s use of particular tools, such as Kanban bins and andon boards. However, Toyota views these as ephemeral adaptations to the demands of its business.

toyota kata
Figure 167. Toyota kata

According to Mike Rother in Toyota Kata [228], underlying Toyota’s particular tools and techniques are two powerful practices:

  • The improvement kata

  • The coaching kata

What is a kata? It is a Japanese word stemming from the martial arts, meaning pattern, routine, or drill. More deeply, it means “a way of keeping two things in alignment with each other.” The improvement kata is the repeated process by which Toyota managers investigate and resolve problems, in a hands-on, fact-based, and preconception-free manner, and improve processes towards a “target operating condition.” The coaching kata is how the improvement kata is instilled in new generations of Toyota managers (see Toyota kata, [74].]).

As Rother describes it, the coaching and improvement katas establish and reinforce a coherent culture or mental model of how goals are achieved and problems approached. It is understood that human judgment is not accurate or impartial. The method compensates with a teaching-by-example focus on seeking facts without preconceived notions, through direct, hands-on investigation and experimental approaches.

This is not something that can be formalized into a simple checklist or process; it requires many guided examples and applications before the approach becomes ingrained in the upcoming manager.

9.8. Industry frameworks

Having discussed organizational structure, hiring, and culture, we will now turn to a critical examination of the IT management frameworks.

Industry frameworks and bodies of knowledge play a powerful role in shaping organizational structures and their communication interfaces, and creating a base of people with consistent skills to fill the resulting roles. While there is much of value in the frameworks, they may lead you into the planning fallacy or defined process traps. Too often, they assume that variation is the enemy, and they do not provide enough support for the alternative approach of empirical process control. As of this writing, the frameworks are challenged on many fronts by Agile, Lean, and DevOps approaches.

9.8.1. Defining frameworks

There are other usages of the term “framework,” especially in terms of software frameworks. Process and management frameworks are non-technical.

So, what is a “framework?”

The term “framework,” in the context of a business process, is used for comprehensive and systematic representations of the concerns of a professional practice. In general, an industry framework is a structured artifact that seeks to articulate a professional consensus regarding a domain of practice. The intent is usually that the guidance be mutually exclusive and collectively exhaustive within the domain so that persons knowledgeable in the framework have a broad understanding of domain concerns.

The first goal of any framework, for a given conceptual space, is to provide a “map” of its components and their relationships. Doing this serves a variety of goals:

  • Develop and support professional consensus in the business area

  • Support training and orientation of professionals new to the area (or its finer points)

  • Support governance and control activities related to the area (more on this in Chapter 10)

Many frameworks have emerged in the IT space, with broader and narrower domains of concern. Some are owned by non-profit standards bodies; others are commercial. We will focus on five in this book. In roughly chronological order, they are:

  • CMMI (Capability Maturity Model-Integrated)

  • ITIL (originally the Information Technology Infrastructure Library)

  • PMBOK (The Project Management Body of Knowledge)

  • CObIT (aka Control Objectives for Information Technology)

  • The TOGAF® framework (The Open Group standard for Enterprise Architecture)

The frameworks are summarized in the appendix.

9.8.2. Observations on the frameworks

In terms of the new digital delivery approaches, there are a number of issues and concerns with the frameworks:

  • The fallacy of statistical process control

  • Local optimization temptation

  • Lack of execution model

  • Proliferation of secondary artifacts, compounded by batch orientation

  • Confusion of process definition

The problem of statistical process control

CMM author Watts Humphrey’s original vision was to apply full statistical process control to the software process. As he stated at the time:

Dr. W. E. Deming, in his work with the Japanese after World War II, applied the concepts of statistical process control to many of their industries. While there are important differences, these concepts are just as applicable to software as they are to producing consumer goods like cameras, television sets, or auto mobiles [130 p. 3].

The overall CMM/CMMI idea (in the original staged model) is that a process cannot be improved and optimized until it is fully under control. Perhaps well-defined industrial processes should not be optimized until they are fully “managed.” However, as we discussed in the previous section, process control theorists see creative, knowledge-intensive processes as requiring empirical control. Statistical process control applied to software has therefore been criticized as inappropriate [218].

In CMM terms, empirical process control starts by measuring and immediately optimizing (adjusting). To restate the Martin Fowler quote from the last section: “a process can still be controlled even if it can’t be defined."[241] They need not — and cannot — be fully defined. Therefore, one of the most questionable aspects of CMMI is its implication that process optimization is something only done at the highest levels of maturity.

In short, the CMMI staged model encourages the thought that process improvement (optimization) only is possible at Level 5. Many companies implementing CMMI stages, however, will pragmatically say “Maybe we only need to get to level 3.” This implies that they define and manage their processes, but never improve them.

This runs against much current thinking and practice, especially that deriving from Lean philosophy, in which processes are seen as always under improvement. (See discussion of Toyota Kata. All definition, measurement, and control must serve that end.

The CMMI has evolved since Humphrey’s initial vision, but between its mis-applicaton of statistical process control, and the idea that that process optimization is only relevant at the highest maturity, it is (in the view of this author) badly out of step with current digital trends.

The other frameworks do not embrace statistical process control to the same extent as the CMMI. PMBOK suggests that “control charts may also be used to monitor cost and schedule variances, volume, and frequency of scope changes, or other management results to help determine if the project management processes are in control” [215 pp. 4108-4109]. This also contradicts the insights of empirical process control, unless the project were also a fully defined process — unlikely from a process control perspective.

Local optimization temptation
We must not seek to optimize every resource in the system … A system of local optimums is not an optimum system at all; it is a very inefficient system.
— Eli Goldratt
The Goal

IT capability frameworks can be harmful if they lead to fragmentation of improvement effort and lack of focus on the flow of IT value.

The digital delivery system at scale is a complex socio-technical system, including people, process, and technology. Frameworks help in understanding it, by breaking it down into component parts in various ways. This is all well and good, but the danger of reductionism emerges.

There are various definitions of "reductionism.” This discussion reflects one of the more basic versions.

A reductionist view implies that a system is nothing but the sum of its parts. Therefore, if each of the parts is attended to, the system will also function well.

This can lead to a compulsive desire to do “all” of a framework. If ITIL calls for 25 processes, then a large, mature organization by definition should be good at all of them. But the 25 processes (and dozens more sub-processes and activities) called for by ITIL, or the 32 called for by CObIT, are somewhat arbitrary divisions. They overlap with each other. Furthermore, there are many digital organizations that do not use the full ITIL or CObIT process portfolio and yet deliver value as well as organizations that do use the frameworks to a greater degree.

The temptation for local, process-level optimization runs counter to core principles of Lean and systems thinking. Many management thinkers, including W.E. Deming, Eli Goldratt, and others have emphasized the dangers of local optimization and the need for taking a systems view.

As this book’s structure suggests, the delivering of IT value requires different approaches at different scales. There is recognition of this among framework practitioners; however, the frameworks themselves provide insufficient guidance on how they scale up and down.

Lack of execution model

It is also questionable whether even the largest actual IT organizations on the planet could fully implement the frameworks. Specifying too many interacting processes has its own complications. Consider: Both ITIL and CObIT devote considerable time to documenting possible process inputs and outputs. As a part of every process definition, ITIL has a section entitled “Triggers, inputs, outputs, and interfaces.” The “Service Level Management Process” [266 pp. 120-122] for example, lists:

  • 7 triggers (e.g.,“service breaches”)

  • 10 inputs (e.g.,“customer feedback”)

  • 10 outputs (e.g.,“reports on OLAs”)

  • 7 interfaces (e.g.,“Supplier management”)

CObIT similarly details process inputs and outputs. In the Enabling Processes guidance, each management practice suggests inputs and outputs. For example, the APO08 process “Manage Relationships” has an activity of “Provide input to the continual improvement of services,” with

  • 6 inputs

  • 2 outputs

But processes do not run themselves. These process inputs and outputs require staff attention. They imply queues and therefore work in process, often invisible. They impose a demand on the system, and each handoff represents transactional friction. Some handoffs may be implemented within the context of an IT management suite; others may require procedural standards, which themselves need to be created and maintained. The industry currently lacks understanding of how feasible such fully elaborated frameworks are in terms of the time, effort, and organizational structure they imply.

We have discussed the issue of overburden previously. Too many organizations have contending execution models, where projects, processes, and miscellaneous work all compete for people’s attention. In such environments, the overburden and wasteful multi-tasking can reach crisis levels. With ITIL in particular, because it does not cover project management or architecture, we have a very large quantity of potential process interactions that is nevertheless incomplete.

Secondary artifacts, compounded by batch orientation
We move away from heavily documented handoffs to a process that creates only the design artifacts we need to move the team’s learning forward.
— Jeff Gothelf
Lean UX

The process handoffs also imply that artifacts (documents of various sorts, models, software, etc.). are being created and transferred in between teams, or at least between roles on the same team with some degree of formality. Primary artifacts are executable software and any additional content intended directly for value delivery. Secondary artifacts are anything else.

An examination of the ITIL and CObIT process interactions shows that many of the artifacts are secondary concepts such as “plans,” “designs,” or “reports:”

  • Design specifications (high level and detailed)

  • Operation and use plan

  • Performance reports

  • Action plans

  • Consideration and approval

and so on. (Note that actually executable artifacts are not included here).

Again, artifacts do not create themselves. Hundreds of artifacts are implied in the process frameworks. Every artifact implies:

  • Some template or known technique for performing it

  • People trained in its creation and interpretation

  • Some capability to store, version, and transmit it

Unstructured artifacts such as plans, designs, and reports, in particular, impose high cognitive load and are difficult to automate. As digital organizations automate their pipelines, it becomes essential to identify the key events and elements they may represent, so that they can be embedded into the automation layer.

Finally, even if a given process framework does not specifically call for waterfall, one can sometimes still see its legacy. For example:

  • Calls for thorough, “rigorous” project planning and estimation

  • Cautions against “cutting corners”

  • “Design specifications” moving through approval pipelines (and following a progression from general to detailed)

All of these tend to signal a large batch orientation, even in frameworks making some claim of supporting Agile.

Good system design is a complex process. We introduced technical debt in Chapter 3, and will revisit it in Chapter 12. But the slow feedback signals resulting from the batch processes implied by some frameworks are unacceptable in current industry. This is in part why new approach