Open Agile Architecture™

22. Software Defined Infrastructure

Digital means that a signal of interest is being represented by a sequence or array of numbers. Analog means that a signal is represented by the value of some continuously variable quantity. This variable can be the voltage or current in an electrical circuit [Steiglitz 2019].

The shift from analog to digital makes it possible to control hardware with software. IT infrastructure is becoming fully digital. Compute and storage resources are digital and have been virtualized for quite a while. Add network infrastructure virtualization to the mix and you create an unlimited number of possibilities for system administrators.

Digital technologies enable the decoupling of infrastructure from the underlying hardware, turning it into data and code. When infrastructure is becoming code, software practices can replace the old way of working; for example, DevOps, which merges source control systems, test-driven development, and continuous integration with infrastructure management.

The software engineering discipline is spreading into domains where the combination of software and hardware creates new and powerful capabilities; for example, autonomous vehicles and connected objects.

This chapter will briefly introduce Infrastructure as Code and illustrate the power of software engineering for managing hardware using the DevOps (see Section 22.2) and SRE (see Section 22.3) examples.

22.1. Infrastructure as Code

Infrastructure as Code is the process of managing and provisioning computer data centers through machine-readable definition files, rather than through physical hardware configuration or interactive configuration tools. The machine-readable definition files provide a descriptive model describing an abstract model of the hardware and infrastructure resources that are needed. This descriptive model is using the same versioning as DevOps teams use for source code. Infrastructure as Code tools transform this descriptive model using dynamic infrastructure platforms to create the actual configuration files required by the hardware and infrastructure components.

A dynamic infrastructure platform is the system that provides computing resources in a way that can be programmatically allocated and managed.

Most common dynamic infrastructure platforms are public clouds (e.g., AWS or Azure^®) or private clouds that provide server, storage, and networking resources.

Infrastructure can also be managed dynamically using virtualization systems or even bare metal with tools such as Cobbler or Foreman.

22.2. DevOps

DevOps is a way of running systems requiring developers and operators to participate together in the entire system lifecycle, from design through development and production support. Operators use the same techniques as developers for their systems work (test, automation, etc.).

The DevOps approach drives better organizational performance by relying on an organization’s ability to efficiently deliver and operate software systems to achieve their goals.

22.2.1. DevOps Objectives: Organizational and Software-Delivery Performance

Organizational performance measures the ability of an organization to achieve commercial and non-commercial goals such as: profitability, productivity, market share, number of customers, operation efficiency, customer satisfaction, quantity and quality of products or services delivered, etc.

The outcomes of better performance are met when capabilities are implemented and principles followed; those capabilities relying on behavior and practices from development and operations teams using tools to do so.

22.2.2. Four Key Metrics

A key takeaway of the DORA State of DevOps report [DORA State of Devops Report 2019] is the four key metrics that support and measure the software delivery performance of an organization:

Lead time of changes: the time it takes to go from code committed to code, successfully running in production
Deployment frequency: the number of deployments to production per unit of time, aggregated by teams, departments, and the whole organization
Mean time to restore: the time to restore service, or Mean Time to Recover (MTTR) metric, calculates the average time it takes to restore service
Change failure rate: the change failure rate is a measure of how often deployment failures occur in production that require immediate remedy (in particular, rollbacks)

These four metrics provide a high-level systems view of software delivery and performance, and predict an organization’s ability to achieve its goals. They can be summarized in terms of throughput (deployment and lead time) and stability (MTTR and change fail). A fifth useful metric is the availability of systems (a count-based availability is also more meaningful than a time-based one [Colyer 2020]).

22.2.3. DevOps Principles

The DevOps approach is based on some fundamental principles that are listed below; some principles come from the Agile Manifesto but some are specific to operations; for example, automation and repeatability:

Continuous delivery of value: deliver working software that delivers value to end users frequently with the shortest lead time possible; working software in the hands of the user is the primary measure of progress
Collaboration: business people, operators, and developers must work together daily – face-to-face conversation is the most efficient and effective method of conveying information to and within a development team
Feedback and testing: feedback is crucial for every activity in the Agile and DevOps chain; without that feedback loop developers and operators cannot have any confidence that their actions have the expected result, and a good testing strategy supports that feedback loop, with tests taking place early and frequently to enable continuous integration
Culture and people: build projects around motivated individuals, giving them the environment and support they need, and trusting them to get the job done; the best architectures, requirements, and designs emerge from self-organizing teams
Continuous improvement: at regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly
Design and simplicity: attention to technical excellence and good design enhances agility; simplicity is essential, reducing toil and maximizing the amount of work not done are crucial (toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows)
Automation: assigning tasks to the machine to avoid human intervention cuts down on repetitious work in order to minimize mistakes and save time and energy for the human operator

The gain from automation, apart from saving time, is consistency, scalability, faster action and repair, and robustness; automated actions provide a “platform” (something that can be extended and applied broadly).

The drawbacks of automation are that the effects of an action can be very broad; once an error is introduced the consequences can be enormous (does it strikes a chord?) – that is why some stability patterns, like a governor, are useful in such cases.

Automation means “code” and, like developers that implement best practices to manage their code, operations engineers should do the same with their automated tasks: design it (what’s the problem/ solutions?), build it, test it, and deploy it.

A good test coverage and testing habits within the team are mandatory for automation to deliver its results; testing provides a safety net and confidence that changes can be deployed without breaking anything.
Self-service: enables consumers of services to use services independently as the IT team has done what was needed beforehand to ensure there are minimal interruptions and delays in service delivery, thus achieving a state of continuous delivery
Shift left: an approach in which testing is performed earlier in the lifecycle (moved left on the project timeline) so that the team focuses on quality and works on problem prevention instead of detection as testing begins and decisions are validated early in development and delivery

Shift-left testing helps to prevent waste on uncovered requirements, architecture, and design defects due to late testing.
Repeatability: DevOps is all about creating a repeatable and reliable process for delivering software, where every step in the delivery pipeline should be deterministic and repeatable: provisioning environment, building artifacts, deploying those particularfacts, etc.

22.2.4. Capabilities

The capabilities leading to better software delivery performance are continuous delivery, architecture, Lean Product Development and management, and improved organizational culture.

22.2.5. Behavior and Practices

DevOps promotes a set of good practices:

Infrastructure as Code: infrastructure is code and should be developed and managed as such (see the automation and repeatability principles), particularly the testing part to ensure the operations code is reliable and safe; testing the “automated tasks” means checking that some acceptance criteria are met
Test automation of application code as well as operations code
Application and infrastructure monitoring
Automated dashboards
Automated release and deployment
Continuous “everything”: integration, testing, delivery, and deployment
- Continuous integration: the process of integrating new code written by developers with a mainline or “master” branch frequently throughout the day, in contrast to having developers working on independent feature branches for weeks or months at a time and merging their code back to the master branch only when it is completely finished
- Continuous delivery: a set of general software engineering principles that allow for frequent releases of new software through the use of automated testing and continuous integration; closely related, continuous delivery is often thought of as taking continuous integration one step further, that beyond simply making sure new changes can be integrated without causing regressions to automated tests, continuous delivery means that these changes can be deployed

Continuous deployment: the process of deploying changes to production by defining tests and validations to minimize risk; while continuous delivery makes sure that new changes can be deployed, continuous deployment means that they get deployed into production
Continuous improvement: retrospective and postmortem activities performed by the operations team following every incident or outage

22.2.6. DevOps Tools

Categories of tools that support a DevOps approach:

Analytics
Application performance management
Cloud infrastructure
Collaboration
Containers and orchestration solution
Continuous integration/deployment
Provisioning and change management
Configuration and deployment
Information Technology Service Management (ITSM)
Logging
Monitoring
Project and issue tracking
Source control management
Functional and non-functional testing tools

22.3. Site Reliability Engineering (SRE)

“SRE is what you get when you treat operations as if it is a software problem.” [sre.google]

SRE is a job role and a set of practices from production engineering and operations at Google and now largely adopted by the industry. An SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their services.

For details about SRE, refer to the series of books: [Beyer 2016], [Beyer 2018], and [Adkins 2020].

22.3.1. SRE Principles

Operations is a software problem
Managed by Service-Level Objectives (SLOs) and Service-Level Indicators (SLIs)
Managing risk with an error budget; a happy medium between velocity and unreliability
Work to minimize toil (manual, repetitive, automatable, reactive, grows at least as fast as its source)
Automation
Shared ownership with developers
Use the same tooling as software engineers

22.3.2. SRE Practices

22.3.2.1. On-Call in Parallel with Project Work

On-call engineers take care of their assigned operations by managing outages and performing changes on production systems within minutes. As soon as a page is received, the on-call engineer is expected to triage the problem and work toward its resolution, possibly involving other team members and escalating as needed. Project work should be at least 50% of SRE time, and of the remainder no more than 25% has to be spent on-call; the rest being operational, non-project work. Why is this practice important? A balanced on-call activity alongside engineering work allows scaling production activities, maintaining high reliability despite the increasing complexity and volume of systems to manage.

22.3.2.2. Incident and Emergency Response

Everything fails, all the time. A proper response to failure and emergency, with a clear line of command and roles, a working record of actions, etc., takes preparation and frequent, hands-on training. Key activities are needed to reduce mean time to recovery and reduce stress while working on problems:

Formulating an incident management strategy in advance
Structuring this plan to scale smoothly
Regularly putting the plan to use

22.3.2.3. Postmortem, “Near Miss” Culture

Completing a postmortem ensures that any incident is documented, all contributing root causes are well understood, and effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence. Even “near misses” should contribute to improvement. A key factor of success is to commit to a blameless culture and spread the values of these postmortems by sharing them.

22.3.2.4. Managing Load

Managing load is important to ensure good performance and reliability, and services should produce reasonable but suboptimal results if overloaded. Every request on the system has to be correctly balanced both between data centers and also inside the data center to distribute work to the individual servers that process the user request. Avoiding overload is a goal of load balancing policies, but eventually some part of any system will become overloaded. The graceful handling of overload conditions is fundamental to running a reliable serving system, ensuring no data center receives more traffic than it has the capacity to process. The rules are:

Redirect when possible
Serve degraded results when necessary
Handle resource errors transparently

22.3.2.5. Non-Abstract Large System Design

Non-Abstract Large System Design (NALSD) describes the ability to assess, design, and evaluate large systems. Practically, NALSD combines the elements of capacity planning, component isolation, and graceful system degradation crucial to highly available production systems. Because systems change over time, it is important that an SRE is able to analyze and evaluate the key aspects of the system design.

Non-abstract means turning whiteboard design into concrete estimates of resources at multiple steps in the process.

22.3.2.6. Configuration Design and Best Practices

Configurations are parameters that allow someone to modify system behavior without redeploying code. Configuring systems is a common SRE task. Systems have three key components: code, data the system manipulates, and system configuration.

A good configuration interface allows quick, confident, and testable configuration changes. Reliability is highly dependent on good configuration practices as one bad configuration can wipe out entire systems.

22.3.2.7. Canarying Releases

Release engineering describes all the processes and artifacts related to getting code from a source repository into a running production system. Canarying releases are the partial and time-limited deployment of a change in a service and its evaluation. The evaluation helps to decide whether or not to proceed with the rollout. The part of the system that receives the change is “the canary”, and the remainder of the system is “the control”. The logic underpinning this approach is that usually the canary deployment is performed on a much smaller subset of production, or affects a much smaller subset of the user base than the control portion.

22.3.2.8. Data Processing Pipelines

Data processing is an important topic nowadays given the growing datasets, intensive data transformations, and requirement for fast, reliable, and inexpensive results. Data quality errors become business-critical issues whenever they are introduced in the resulting data. A pipeline can involve multiple stages; each stage being a separate process with dependencies on other stages. Best practices identified by SRE include:

Defining and measuring SLOs for pipeline
Planning for dependency failure
Creating and maintaining pipeline documentation
Reducing hotspots and workload
Implementing autoscaling and resource planning
Adhering to access control and security policies
Planning escalation paths

22.3.2.9. Configuration Specifics

The task of configuring and running applications in production requires insight into how those systems are put together and how they work. When things go wrong, the on-call engineer needs to know exactly where the configurations are and how to change them. The mundane task of managing configurations replicated across a system leads to replication toil; a toil particularly frequent with distributed systems. Configuring production systems with confidence needs the application of best practices to manage complexity and operational load. Configuration mechanisms require:

Good tooling (linters, debuggers, formatters, etc.)
Hermetic configurations (configuration languages must generate the same configuration data regardless of where or when they execute)
Separation of configuration and data (configurations should, after evaluation, drive several data items downstream, providing clear separation of evaluation and side effects)

22.3.3. Cloud-Native Infrastructure

There has been a profound mutation at work for several years: hardware is now controlled by software through APIs, and hardware now comes in interconnected pools that can grow and shrink dynamically. The following five characteristics are the foundation of a paradigm change for running software:

On-demand self-service: consumers can provision computing resources as needed; automatically, and without any human interaction required
Broad network access: capabilities are widely available and can be accessed through heterogeneous platforms (e.g., mobile phones, tablets, laptops, and workstations)
Resource pooling: provider resources are pooled in a multi-tenant model, with physical and virtual resources dynamically assigned and reassigned on-demand; the customer generally has no direct control over the exact location of provided resources, but may specify location at a higher level of abstraction (e.g., country, state, or data center)
Rapid elasticity: capabilities can be elastically provisioned and released to rapidly scale outward or inward commensurate with demand; consumer capabilities available for provisioning appear to be unlimited and can be appropriated in any quantity at any time
Measured service: cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts); resource usage can be monitored, controlled, and reported for transparency