6. Detect to Correct

Security Lockdown

Friday finally rolls around and, for the first time in months, Kathleen feels as though she can completely relax. It had taken just ten days to implement the changes to make sure all requests were approved, and the CAB is now fully aware of all changes from the stream-aligned teams going into production. In the following two months, they had managed to get four releases into production without major incidents. Even the Governance, Risk, and Compliance (GRC) team had started to use the Idea system to get security systems in place.

With summer in the air, Sven and Kathleen decide to take the next week off and go up to their cabin in the mountains. They give the kids the choice to come to the cabin with them for hiking and campfires or stay with Grandma and Grandpa. The unreliable cell phone service and absence of a TV at the cabin causes the kids to elect to stay in town at their grandparents’ house for swimming, playing with their two dogs, and eating Grandma’s famous home-baked treats. Over the last several weeks, Kathleen and her family of four had been staying in quarantine together in their small house full-time due to the global pandemic, following government warnings to shelter at home. After working at home and home-schooling the kids, both Sven and Kathleen have had enough “family bonding” and are relishing the thought of getting out of town, just the two of them.

As she and Sven start the two-hour drive to the cabin in the rental car from the airport, Kathleen looks out the window and reflects on all that has happened over the last year. She and her team had worked hard to get the Customer Digital Intimacy initiative going and to get the whole team aligned across the IT4IT value streams. As the digital products started to become operational, the new way of working started to become routine, and the new ceremonies were becoming the norm, Kathleen began to feel she could take some time off. Sven begged her not to bring her laptop and to take a rare break from her addiction to work, but she felt she would be more relaxed if she had it with her “just in case”. She promised Sven she would hide it from view on a shelf in the front hall closet.

For the first couple of days at the cabin, Sven had to implore her to stop walking around the property with her arm outstretched holding her phone out in front of her, trying to get cellular service for long enough to check her emails. But, by Tuesday, she started to relax and from then on, when they went on their hiking or canoe trips, she left her phone behind and was able to enjoy nature and the beautiful surroundings with Sven for the rest of their week together.

On Sunday, Sven drives them to the rental car drop-off center.

“It is a shame we have to go back already,” says Kathleen, with a sigh. “I wasn’t aware of just how much I needed this trip to unwind. Thanks for making me take time off and encouraging me to not think about work at all.”

Just as Kathleen finishes speaking, her phone rings. She recognizes the number immediately as one from the office and without hesitation she declines the call. Whatever this is, she thinks, it can wait. It’s already the end of the day, and she will be back in the office early on Monday. She knows she will check for messages when they arrive home anyway.

Within one minute, the phone rings again. Same number. Curious about what the fire could be, she apologizes to Sven and decides to answer the call. Sven shrugs, all too familiar with her choices, and motions for her to get out of the rental car and follow him to the shuttle bus that will take them to the airport departure gates.

“Hi Kathleen, Nick here.”

Nick has been covering her duties during her absence. Kathleen hears the anxiety in his voice.

“Sorry I have to disturb you, but I really need to pick your brain. Starting this afternoon, we are getting issues with the operation of the new digital products platform, where we host our digital products. Our customers are experiencing intermittent issues with accessing their own information and every other new request is not getting through. To make it worse, the Operations team is not able to get anything out of their monitoring systems, so they are logging in manually to individual servers and containers and using their tribal knowledge of the environment to do triage without any centralized information. The most annoying issue is that, once we have solved something, a few hours later our change seems to be rolled back, even though we have stopped all deployments. This means we are basically back to reactive mode only, which makes for a very slow triage speed. They’re basically trying to fly a plane when all the instruments have gone dark.”

Trying to hear Nick over the noise of the rental car shuttle bus and making a face to Sven that she hopes conveys, “Sorry about this”, she listens closely to Nick as he walks her through the details on their triage efforts.

“To solve the issues,” Nick continues nervously, “I have established two incident resolution swarms as we agreed – one for the product issue and one for the monitoring issue. We have checked all deployments done by the teams in the past 48 hours, but so far this has not revealed anything. We are at the end of our options of what to look for, so I wanted to check with you if you could remember anything we might have missed. We have not put anything into production in the last 24 hours!”

The longer the conversation goes on, it is clear the teams have been doing a great job in identifying what the problem is using the approach they had tested out so much. Kathleen feels gratified by the willingness of the organization to embrace all the new ways of working. However, even with everything in place, they were somehow still failing to identify the root cause.

Kathleen is grateful to Sven as he gently moves her like a robot through the process of checking in for their flights, and ruefully watches as he handles all their luggage by himself.

“OK,” says Kathleen. “Can you get the list of change requests that have been accepted in the past three weeks?”

There is a pause on the other end of the line. “Well … We know all changes from the stream-aligned teams and none of them indicated any issues. But apparently, given the new way of working with the automation of creating changes for the CAB, the original formal change process has been dropped completely and we have identified there have been changes implemented outside the automated approach.”

“OK, that is not what I expected, clearly something we missed.” Kathleen sounds worried. “But how do we know what other changes have been implemented?”

“It seems the Operations team have been organizing themselves around the Agile way of working as well. They have started to adopt site reliability engineering[1] and currently they have reorganized into four teams to carry out the development of scripts, monitors, and other types of automation, trying to keep up with the rate of change that is hitting them. A big win has been the formation of Communities of Practice,[2] which unite professionals at ArchiSurance with shared interests in industry best practices, like automation – for managing operations in a scalable, reliable, and highly automated fashion. If they hadn’t done this, it would have been an issue for them to support the new digital products platform.”

Without letting Kathleen get in a word edgeways, Nick continues. “I know – before you say anything – it was also a surprise to me that they had implemented new things in production, even though it had nothing to do with the functionality of the platform itself. However, they have been doing a great job. For example, there is now a new discovery and monitoring environment over the last four weeks, rolling out discovery agents to all platforms and automatically assigning the appropriate monitors to them. Unfortunately, starting this morning, both the discovery and the monitoring failed to report centrally.”

Kathleen starts to feel a stone in her stomach. This means they are no longer really able to understand what is going on in production. The first set of deployments was done without any problem; all the teams had aligned themselves on the releases and the new development techniques made the teams operate with relative independence. So why did this hit them so hard? She really needs to look into this when she gets back in the office. She also starts to feel nervous that there might be even more that she and her team are not aware of in production, given that there were also deployments done outside of her visibility as an architect.

“Nick, can you do me a favor and check with the Portfolio Management team and ask them what approvals have been given to which investments? Perhaps there is an investment approved that we have missed? See if you can learn from them what was supposed to go into production in the past two weeks. OK?”

Sven taps Kathleen on the shoulder. “If you really want to make our flight, then we have to go through security right now.”

Kathleen nods. “I really have to go right now, but I’ll call you when I’m sitting at the departure gate. Can you have the list ready by then?”

Nick answers in the affirmative and hangs up.

As she moves through the security process, Kathleen is reminded how very much she dislikes the policies implemented at airports, despite knowing why they are needed.

“Unfortunately,” she says, turning to Sven, “security is ubiquitous – it touches us all whether we like it or not and –”

Kathleen stops talking mid-sentence. Sven looks at her, puzzled. “Why the funny face? Yes, security practices are unpleasant, but very necessary. Why are you making a big deal out of it now?”

Kathleen’s brain has shifted into a higher gear and she ignores Sven completely. She suddenly remembers that the Security team recently released a new version of their security platform. As part of that release a new set of policies was supposed to be deployed into production. The Enterprise Architecture team has been asked to help to identify the set of systems that would be covered by these policies so that they could be tuned to make sure production would not be impacted.

After the initial implementation, thereafter, the Security team would be able to apply the necessary policies to each of the systems. But the Security team took it one step further. The system had its own AI functionality and was able to apply the necessary changes to systems in production automatically based on the applicable policies. They had identified all mission-critical systems and had also agreed to apply the highest level of security to all machines associated to those mission-critical systems, including those that would be discovered in the future, beyond these early identified systems. Since all mission-critical systems had been included, nobody would expect any major issues as a result of this approach. However, given that the Enterprise Architecture team was not even aware of the changes done by the Operations teams, the new monitoring systems were most likely not part of one of the identified mission-critical systems.

Kathleen starts to rush through the airport security process, dragging a confused Sven with her. She has to give Nick a call as soon as possible. Most likely it was the security system that was blocking the monitoring system and its AI functionality was constantly rolling back the changes made by the Operations team. Once she finally finds an open seat in the already crowded lounge, she quickly dials Nick’s number.

“It’s the new security system!!” they both say in unison when Nick picks up the phone.

Nick tells Kathleen that once he opened the portfolio management system, he identified that the only other change was the security system. When he reached out to the Security team, they confirmed the rollout of a number of new policies that day, including the automatic rollout of the highest level of security for new systems. They also indicated they could not figure out why the system was constantly applying the same policies over and over again and were frantically looking to see if there had been a security breach.

Kathleen and Nick agree that Nick will take care of managing the rest of the resolution, by bringing the Security team into the incident swarm.

By the time Kathleen’s plane lands, all operations have been returned to normal and the incident swarm has been dissolved. Sven had complained to her on the flight that he thought something must be wrong in her organization if they could not function without her. Although she knows he is right, and it is the result of an organization design gap that lacks cross-discipline visibility, Kathleen was actually feeling pretty happy with the way the situation had been handled. Yes, there was still the open issue of the unknown changes and the inability of the Architecture Repository to flag discrepancies with the actual production, but they were well on the way toward working together across functions. Once again, she found herself wondering if that last IT4IT value stream would be able to help her as before. However, this really was something that could wait until Monday. It was time to pick up the kids and return to their “new normal”.

After arriving at work on Monday morning, Kathleen calls for a post mortem session, so they all could learn from the incident in a safe environment. She invites the Security, Operations, and Development teams as well as her own Enterprise Architecture team, and opens with the questions: What went right? And what could we have done better?

She has her own ideas about the answers and is glad to see the outcome of the session is going in the same general direction. The team thought that what they did well was getting together so quickly, bringing in the necessary knowledge into the incident swarm and – outside of some minor adjustments to the way people are called into the swarm – they agreed it was fit-for-purpose. However, they did identify two things that could be done differently and better in the future: first, to understand the actual state of production, and second, to identify the potential impacts of changes.

Kathleen agrees to come up with a solution, indicating she would review the IT4IT value streams to see what solutions could be used.

As she expects, there is one value stream left she has not looked at: Detect to Correct. When she looks into the details of Detect to Correct Value Stream Diagram, she comes up with a solution for both issues.

500
Figure 10. Detect to Correct Value Stream Diagram

The first item is to make sure it’s possible to compare the actual state of the production environment with what is known in the Architecture Repository. To achieve this, Kathleen notices the IT4IT Reference Architecture maintains the “Actual Product Instance”, within the Configuration functional component. This Actual Product Instance data object is part of the Digital Product Backbone, and all data objects in the Digital Product Backbone have defined relationships. So, the solution is based on the information captured in the configuration management system of production, which includes the ability to discover the actual environment. The comparison can then be made with the “Desired Product Instance” which is maintained by the “Fulfillment Orchestration” component. When something is implemented in production outside of the approval process, it is picked up by discovery; but such a change would not have changed the “Desired Product Instance”, so the difference in the comparison would indicate something has changed without approval.

In order to solve this, she designs a system so that when the discovery reveals something new, it creates an Incident, assigned to the Enterprise Architecture team who can then investigate what to do with this product. Maybe not the best way, but for the time being a “good enough” solution. She makes a note to herself to look into bringing the Actual Product Instance into the Architecture Repository and bring this idea back to The Open Group IT4IT Forum; perhaps this is an idea for a future release of the standard.

The big benefit for the Enterprise Architecture team is that by implementing this closed loop, the governance of the architecture across the organization will be much easier; based on actual information stored in systems available to everybody, rather than only in architecture diagrams used by architects.

The other part of the solution is to ensure the Change component, and especially the Change data object, contains a register of changes. The creation of a Change is done from any source, including the Fulfillment Orchestration functional component. Any CI/CD tool chain would have to be integrated, not only from the stream-aligned teams, but from any team, including the Operations team. All changes, of any type need to create a record in the system, even from teams not using a CI/CD tool chain. Not to enforce a process to run, but at least to have all changes in a central repository. This would provide a consolidated view of all changes – those done manually and those done via automation. This would give future incident swarms a good overview of what has been changed.

She also decides that the people from the Operations team will become part of the Enabling team, with a focus on the digitalization of the Product Management lifecycle. Security will need to align to this as well. The big benefit Kathleen sees for the Enterprise Architecture team is that by adding this functional component, and connecting the Change to the Desired Product Instance, it is possible to track the transition states and the completeness of the target state of the architecture planning, allowing full visibility into Architecture Change Management. This will be a tremendous benefit for the coordination of releases and for the communication to employees and customers about upcoming features available for the Digital Customer Intimacy strategy.

As with the other implementations, she submits an Idea into the portfolio management system and, before she knows it, it is approved and the implementation of the integrations starts.

Reducing Complexity

With the end of summer right around the corner, Kathleen’s mind wanders back to the cabin and how nice it must be up in the mountains right now. She will talk to Sven about another trip when things start winding down at work in a few months.

Pulling herself back to focus, Kathleen calls in her architects to discuss the next iteration of architecture design. Throughout the implementation of the various new products, some solutions implemented by the various teams had overlapped with existing solutions. More and more signals from the business users indicated that they were getting fed up with the constant need to work in different systems which had this overlap. It was getting harder and harder to get a good overview.

She discusses this with Dick. “That’s something that is bugging me as well,” he says, “but from a different angle. We as digital technology professionals have to constantly keep more systems up and running, so even though the new systems are more efficient in maintenance and operations, we can’t show any cost reduction back to the business, because we now have more systems, not less. I think it is time we formed a group to start driving application rationalization. Can you try to get all the stakeholders aligned and prepare a way to assess the different applications so we can decide which ones we must sustain and which ones we can retire?”

Kathleen agrees and tasks Nick to drive the application rationalization initiative. Some weeks later, Nick presents his findings to the group.

Nick starts his presentation by flashing up a slide of a Capability Map on the screen. Jokingly he says, “It is actually very simple, people! The solution to the assessment is the Capability Map. Period. Carry on. The end.” He sits back, grinning.

“But all joking aside,” he continues quickly, “what is true is that the main framework to support an assessment for application rationalization is the Capability Map.”

Over the next hour Nick explains how the map can be used to assess which applications are connected to which capability, and how a classification method along the axes of Value, Cost, and Technology could be used to score each application to identify what type of transformation would need to be done, using the 5Rs of Gartner® [25].

“For the situation where one capability is being delivered by multiple applications, we can choose the best one(s) to sustain by selecting the one(s) with the highest scores on the three axes. For the situation where there is no application covering a necessary capability, the scoring can be used to identify which of those gaps – if any – are a priority to instantiate.”

They all agree the approach is a sensible one and Kathleen agrees that the next step is to take Nick with her to give Dick the five-minute executive summary version of it.

Before ending the meeting, Kathleen addresses the team.

“As you were talking about the application rationalization portion of your presentation, I was thinking about another gap we could fill using the IT4IT Standard. A week or so ago, I ran across an IT4IT guidance paper in The Open Group Library that gives step-by-step instructions on how to use the IT4IT Reference Architecture to rationalize your tool landscape [26]. We definitely have lots of product/service management tool duplication and redundancy across our functional organizations. In the process of building the integrations between the Enterprise Architecture tool and the Portfolio Management tools we found lots of redundancy and had to pick a primary source system for the integration. People will need to start using the source systems and quit using their disparate document collaboration sites for collecting important information. Otherwise, we are never going to have a centralized way of capturing the flow from Strategy through Operations. I will work out some instructions, which I will send out to all of you on the integrations that establish flow from Strategy to Portfolio. Furthermore, in addition to the integrations we set up, there also needs to be some criteria and governance in place for when something new is proposed for the portfolio. For example, we should never start work on something that is not directly tied to the company strategy. Also, we should always look to see what we already have across the company before we start creating a new capability. That is why our capability assessments are so important. Although we have established several integrations that can feed governance, the actual activities are not fully in place yet. We still need to improve our governance processes to ensure we are working on the right things going forward.”

“I am pretty well-versed on the IT4IT Standard,” says Terri, “And I actually participate in one of The Open Group IT4IT Forum work groups. So please, just let me know if you have any questions. If I don’t know the answer, I can probably find someone in the IT4IT Forum who does. I’m pretty well connected to many of the core members through the quarterly live events and through the virtual work group meetings I attend every month. Originally, I started going to the events to learn more about the TOGAF Standard. That’s where I stumbled across the IT4IT Standard. I’ve seen how well it works together with the TOGAF Standard.”

“Oh, that’s great, Terri,” says Kathleen, “I’m just learning about the breadth of the IT4IT Reference Architecture and the business value. The IT4IT Value Network has really helped us make our digital implementations cheaper, better, and faster. I’m still a novice, so thanks for letting me know about your network. I’ll definitely use you as a source of knowledge and expertise. The more I dig into the IT4IT Standard, the more I realize the strength of the framework for the governance of products and services. Now that our company is on its way to becoming digital and the business and technology are blending together when it comes to our strategy, I really believe we can use the IT4IT Standard to bring traceability from our company’s digital products from strategy through execution and into operations.”

Everyone nods in agreement and Terri finds herself smiling. She’s been trying for years to get the Enterprise Architecture team to use a consistent framework, like the IT4IT Reference Architecture, to manage products and services across the company. She was happy to see that others were realizing this, and it was finally happening. “You’ve got my support on using the IT4IT Reference Architecture for managing and governing our digital products and services,” she says. “Just let me know how I can help.”

“Well, I’ve not had time,” says Kathleen, “to devote to putting a new governance structure in place and I have been pondering who I would have confidence in to lead this important piece of our Digital Transformation. It would entail a big change from our current project-based governance to a more Agile digital product-based governance practice. Are you interested in taking the lead on creating a new governance function for managing architecture changes?”

“Sure, I would definitely like to take the lead,” says Terri. “It’s exactly the kind of challenge I’ve been preparing for. Thank you so much for the opportunity.”


1. The DPBoK Standard (1), Section 6.2.3: “Operations Management”. This was originally from Site Reliability Engineering: How Google Runs Production Systems, by B. Beyer, C. Jones, J. Petoff, and N.R. Murphy, April 2016, published by O’Reilly.
2. For more information on Communities of Practice, refer to: https://www.scaledagileframework.com/communities-of-practice/ (24).