My bingo card for 2024 did not have “get invited and approved to talk at a technical event in an official capacity“.

Yet here I am, having done that while thoroughly enjoying the experience.

Red Hat invited me to speak at their Automation Everywhere event on the 21st of November 2023 in Perth, Western Australia. I was given a 15-minute slot to summarise the last four years of my work that has involved stepping towards automated systems of work within an operational technology context. That is, as you might imagine, quite difficult to do. Especially considering I had to balance the level of detail that my organisation’s corporate affairs and legal departments would be content with lest I be banished to the shadow realm and never seen again.

In this post, I will provide a debrief of the event, my speaking notes, and reflect upon it.

This also provides an idea of what can be expected from my “Automating the Enterprise” series. However, it is vendor-agnostic and generic in nature. Consider this a partial implementation of the first case study I am writing.

Before that…

IMPORTANT DISCLAIMER & DISCLOSURE

  • Views expressed in this article are my own and do not represent that of my employer and/or Red Hat.
  • I have not received any compensation or benefits from Red Hat to produce this article, nor were they permitted to provide any editorial input.
  • I occupy a role at my employer that manages commercial contracts, projects and products with Red Hat.
  • I pay for Red Hat products for private research and use cases at home.

I strive to take a neutral stance as much as possible, but please consider the above points introduce bias into my thoughts and opinions.

 


Automation Everywhere

The event focused specifically on how attendees could kick-start their journey to automation and orchestration potentially with Ansible and Red Hat’s commercial offerings. I liked the minimal sales focus and a no-nonsense approach to the event. This was evident when, before accepting the speaking pre-engagement, I confirmed that Red Hat did not want a 15-minute puffery piece but rather an overview of all the opportunities and challenges I faced with automation from both a methodology and product perspective.

It started with what I thought was a rather clever slide about how automation starts simple and is usually driven by someone irritated at performing repetitive tasks. Their first step into automation may take the form of a bash or bat file or could even use a higher-level Python script. The story pivoted to sharing how people do this in isolation, especially in larger organisations unknowingly, and then described the challenges faced when attempting to scale highly ad-hoc and specific ways of working. Couple this with the fact that not everyone within the infrastructure world has the time nor desire to take their knowledge of a high-level programming language from “hello world” to executing configurational changes across a large inventory of devices, and you have three rather big business problems:

  1. How do you lower the entry point to automation?
  2. How can you build an ecosystem of partners and an open-source community that simplifies the journey to automation?
  3. Who will support all of this once it is built?

Spoiler alert. Red Hat believes they can solve these business problems with the Ansible Automation Platform.


Owned by and used with permission from Red Hat.

I found that the journey mirrored my own experiences quite closely. I had previously written PowerShell scripts ad hoc throughout 2017 to triage radio-related incidents and programmatically generate reports through product APIs. I took an interest in network automation in 2018. Still, I lacked the business, people and technical skills to articulate how automation would assist us in reducing risk and “achieving more with less”. When the Cisco DevNet program was launched in 2020, I even saw teams tackle automation in an ad hoc manner. None of this was terrible. However, it was not scalable. This was not a problem concerned with people and their capabilities. It was a problem born from technical teams being unable to articulate why investment in process and tooling changes was necessary to implement and sustain automated working methods.

The presentation then explored Ansible Tower’s evolution and why re-architecting towards a distributed architecture was necessary to support automation at scale with the Ansible Automation Platform, especially in releases 2.0 and above. Event Driven Automation was covered as a feature that can assist teams with closing the gap between event monitoring, incident management, and problem management practices. There was a strong focus on how customers could peruse content collections submitted by the open-source community through Ansible Galaxy to avoid reinventing the wheel. Red Hat also highlighted how they support and verify functionality for the commercial “certified content collection” offering.

 


Owned by and used with permission from Red Hat.

 

Owned by and used with permission from Red Hat

If there was a central theme to the event, it was simplicity.

This was reiterated with a Red Hat & IBM’s Ansible Lightspeed demo with watsonx code assistant. This generative AI service uses natural language processing to turn written prompts into code snippets for the creation of Ansible playbooks. The idea is to take the collective expertise of Red Hat playbook developers, provide a tool that solves inconsistent playbook development, lower the barrier to entry for people who want to automate tasks but not necessarily learn the specifics of YAML and Ansible syntax, and give organisations the assurance that they are in control of their data and how it is used for machine learning training purposes.

Owned by and used with permission from Red Hat.

Dear reader, let me level with you for a moment.

I am highly sceptical of AI’s actual value in simplifying the maintenance of complex systems. These tools are a neat and exciting concept; however, I have yet to see how they can lower a business’ operational risk. I believe that AI tools can enhance an organisation’s culture by empowering people, especially the non-technical crowd, to become better problem solvers… with some specific caveats. Let us ignore the cyber and data privacy concerns, as these come with any new tool.

For example, how do you tie AI tooling back to a measurable decrease in OPEX or CAPEX unless you make roles redundant? Even if you see productivity increases, is it because of AI specifically, or is it because adopting AI forced you to increase process maturity? How do you avoid a scenario where your workforce loses capability and becomes overly dependent and reliant upon AI tooling to perform their BAU activities? These all effectively boil down to the same questions I had to answer when proposing automation: how do you build guardrails and a culture of operational discipline that prevents people from acting upon the temptation to automate undefined business processes within a complex system such as a network?

Given these thoughts…

I think there is genuine potential with Ansible Lightspeed. I like how intuitive and straightforward the demo is. I am curious to see whether I can put it in front of a control system engineer, give them Visual Studio Code with Ansible Lightspeed (trained with context on the environment and its technical development standards), and get them to develop and maintain their playbooks. I have yet to fully form an opinion on this particular tool; however, I recommend checking it out to draw your conclusions.

Sharing My Journey with Automation

After a quick break and an excellent talk by my fellow speaker, David Rose from Horizon Power, I took the stage.

I informed the audience that I would not focus on the technical aspects of automation but on the challenges brought by proposing its use at a central hazard facility.

I have modified my original 15-minute speech to read better in an article format, to also include responses from the Q&A panel and conversations I had while networking with people. As this speech has been modified, please recall that it represents my own individualistic opinions and not those of my employer!


Good Afternoon,

My name is Luke Snell, and I am a Principal for networking at BHP Olympic Dam. If I could summarise my role in a single sentence, it would be to solve complex business problems safely with technology. I focus primarily on operational technology networks, also known as OT networks.

There are plenty of excellent resources to learn the technical challenges and use cases of automation, however, I am here today to cover the non-technical elements of our automation journey. You will learn about some of the many challenges we have faced when considering OT NetDevOps.

So, what is Olympic Dam and why is our work so important?

BHP’s purpose is to bring people and resources together to build a better world, and in doing so, has recognised that copper is essential to life and our modern society. Olympic Dam is a fully integrated underground mining and surface processing facility that transforms ore to metal. One thing I love about working in the Olympic Dam Technology team is I can see how our OT network enables the production of gold and silver bullion that I can then hold in my hands just down the road at the Perth Mint.


Information courtesy of BHP and publicly available at https://bhp.com

Before I proceed, let’s first discuss the difference between a traditional enterprise IT network and an industrial OT network.

My take on it is that it’s a difference in who consumes the network and how they go about it.

IT networks are typically interactively consumed by people. There is someone physically in front of a device accessing their enterprise applications. A failure within an IT network can be painful and impact the business; however, it’s unlikely to cause explicit harm towards people or the environment.

OT networks, on the other hand, are generally consumed by things in a non-interactive manner. A device is typically programmed to relay information or command sequences between different devices in a manual, semi-autonomous, or autonomous fashion. A failure within an OT network can result in catastrophic consequences that may seriously injure or kill people. Understanding and controlling risk is mandatory to administrate an OT network safely.

OK, so now that we’ve got environmental context and some definitions in place, let’s make the rest of the talk all about the journey.

The first step in automation must be to understand your business and how technology acts as an enabler.

Olympic Dam is a phenomenal operation. It is located 560 kilometres (~347 miles) north of Adelaide in South Australia. It consists of underground and surface operations. Ore is mined underground and hauled by an automated train system to crushing, storage and ore hoisting facilities or trucked directly to the surface via declines. There are approximately 700km (~435 miles) of roads and tunnels underground. Once the ore reaches the surface, it is processed through grinding, flotation and leach circuits, a hydrometallurgical plant incorporating solvent extraction circuits for copper and uranium, a copper smelter, a copper refinery and a recovery circuit for precious metals. If that sounds like a lot, it’s because it is!

Information courtesy of BHP and publicly available [here], [here] and [here]

The challenge that we face within networking is like that of a workshop facilitator who strives to enable people to communicate effectively. If we think about the primary objective of a conversation, it is sharing information so that it may be acted upon. What we are trying to communicate influences how we communicate, and then we follow social cues to know when we need to listen or respond. Think about this Afternoon. We have established a topic for communication where folks here hope to understand how automation everywhere can help their business! To enable this, much thought was put into the forum and how communication occurs. This isn’t a private unicast conversation between yourself and me, nor a public space where we broadcast our thoughts and opinions, hoping that public members will tune in and listen to us. No, it is an event that people had to register to attend to tune into a specific conversation in a 1-to-many manner, multicast! I reiterate that the context of what we communicate is critical in shaping the how. If we were discussing highly confidential information in tonight’s forum, then it goes without saying that there would be restrictions around what we communicate and with whom. Finally, the when – well, nothing we have to say means anything if we sit here in silence, does it? Try starting a conversation with someone and then randomly pausing for 10 seconds as you’re saying something – it’s quite awkward, isn’t it?

Thinking of problems in networking in this manner naturally reveals why cyber security practices must be embedded into all aspects of network management, as opposed to being considered an afterthought or bolted on at the end. Think of communications within the context of the CIA triad –  confidentiality, availability, and integrity. Confidentiality – we must think carefully about what information we are communicating and how to keep it from prying eyes from unauthorised parties. Availability – a conversation cannot occur if information cannot be transmitted or received within a negotiated timeframe. Integrity – how do we know that information can be acted upon safely if we do not verify its legitimacy?

Let’s bring these concepts home. The challenge we face with enabling our business is ensuring that the tens of thousands of conversations across the OT network are facilitated securely and in a truly highly available manner. This requires us to guarantee that our network infrastructure remains consistently configured to align with the intention of the business, proving compliance with all the dependent systems’ requirements, and understanding all of those dependencies. Deviations or drift from intended operational behaviour MUST be detected and remediated safely, and if we cannot achieve that, then we MUST force the network into a recoverable safe state WITHOUT increasing our risk profile. That’s just the operational piece! We also need to empower problem solvers by simplifying the task of obtaining live data about the network’s state. Without this, ideating and understanding the risk change brings to a complex system is challenging and thus degrades our people’s ability to innovate.

The second step in your automation journey is to identify how to take a business-driven approach that focuses on risk reduction.

From a technical perspective, automating value streams and processes is an absolute no-brainer. The challenge lies in being able to demonstrate to EVERYONE, not just leadership, that doing so WILL NOT increase the operational risk of your business. The last thing you want is a system whose failure creates a rapid and uncontained cascading impact on safety and production systems. This was a common fear we had to understand how to address lest our automation journey fail before it even started.

So, the questions that we asked ourselves were:

  1. How do we ensure this journey is rooted in safety?
  2. How do we bowtie the entire activity back to our risk management practices?
  3. How do we avoid analysis paralysis and actually start the work by progressing iteratively and with customer feedback?

The answers were right in front of us, and to get them, we needed to put on our personal protective equipment (PPE) and delve into the field!

Industrial verticals, such as mining, have already given these problems a great deal of thought since the Industrial Revolution. Manual tasks were offloaded to a basic process control system. As riskier tasks were automated, safety instrumented systems began to crop up to mitigate and reduce the likelihood of uncontained failure. Functional Safety Engineering became a discipline, and standards were developed by organisations such as the IEC to ensure these systems were designed with due diligence in mind. SCADA systems became more popular to ensure people could safely control large-scale hazardous operations remotely without exposure to undue risk. We knew we had to talk to the people who live and breathe industrial automation daily to learn how they confronted that foundational fear. All our conversations had one central point of feedback – you cannot begin to safely automate a process that has not yet been defined, documented, and optimized as much as humanly practical.

This approach is “common sense”, so much so that ITIL 4 describes it in the guiding principle “optimize and automate”. However, common sense without environmental context is meaningless. You must understand what “common sense” means to your organisation, the risks present in executing manual work, and how automation will sustain or eliminate those risks with auditable and traceable controls. Seek criticism and run towards sceptical stakeholders, as while people may be prickly, they can help identify this. Networks are complex systems, so it’s important to involve the right amount of people who are dependent on automation while you are planning that journey. Finally, large organisations MUST avoid the temptation to develop and replicate an architectural pattern without first understanding if assumed risks and ways of working remain consistent across different business units.

The third step in your automation journey is to ask the question – Are we mature enough to do this safely, and if so, what practices are required to build a sensible methodology?

You cannot safely commence automating value streams and processes within a system whose dependencies you do not fully understand. This is risky and negligent behaviour.

Consider a simple conveyor belt system used between two plants. Operating and maintaining the conveyor belts is simple when we consider them as components. However, as part of an integrated system, or mechanism, several dependencies between the components become immediately obvious. What happens if the conveyor system consists of three belts and one of them stops while the other two run? What logical services are dependent on the conveyor system that is then impacted and faulted if it is not available? What do these faults mean for the conceptual architecture of a system of systems and what follow on business impacts are felt? It is a simple thought exercise that illustrates rapid cascading impact.

I believe that change is much simpler to identify and understand in the physical world than in the digital world.

Let us suppose a refinery needs its existing crane to be upgraded. Construction barricades, information signs with critical contacts, scaffolding, and a team of engineers will all exist. If you need to install a new network device within this refinery it is extremely clear that the environment is being changed. You will establish contact with personnel and understand how to plan and safely execute your work within the context of all changes being executed. It is critical that simultaneous operations (“SIMOPS”) be managed carefully, so working groups in the same area do not inadvertently create conflicts and hazards.

Now let’s focus on a purely logical change within the network. How do you know whether a segment of the network is pending changes? How do you promote visibility and collaboration between all teams, especially considering some may be off-site and remote so that conflicts in change are detected and remediated quickly? How do you avoid SIMOPS? How do you ensure that all changes are tested so that the impact on dependent systems is captured and understood? Most importantly – how do you establish work practices that provide rapid feedback and a path of least resistance to people, so that the ability to work outside of safe systems of work is contained so far as reasonably practical?

The good news is that DevOps, specifically GitOps, provides a nice foundation for you to consider as a starting point. I think that it is critical to integrate your methodology with Functional Safety Engineering standards and practices. These provide excellent information on how to demonstrate the controls required to prevent and mitigate uncontained failure within your automation system. For example, HAZOP and CHAZOP will examine the system from tip to toe, especially within the context of basic process control and/or safety instrumented systems.

The Red Hat logo is owned by and used with permission from Red Hat

I strongly believe that you must ask a series of uncomfortable questions that start with an audit of your organisation’s process maturity. Are your processes documented? Are they fit for use and fit for purpose? Are they actually used by operational teams? Are they maintained?

However, solving this problem actually poses another problem! How do you actually get started without getting stuck in analysis paralysis?

Consider using the Cynefin Framework to understand and model complexity within the contexts automation will be applied by benchmarking your environment with respect to the framework. Another valuable and mandatory framework, in my opinion, is SABSA [link].

 


The Cynefin Framework


The fourth step in your journey is to assess partners, suppliers and products that support your established methodology.

We now had a methodology to draw from that demonstrated that we were framing automation as the equivalent of a digital safety instrumented system. We needed a partner who could assist us with optimizing and automating existing manual value streams and processes and could work iteratively with us to demonstrate value so that we could make this project a reality.

We needed a tool from that partner which:

  1. The industry established and adopted it, developed with security in mind, could integrate with existing systems easily, supported custom workflows, and could scale if needed.
  2. Kept automation simple, practical, and easy to adopt.
  3. Promoted visibility of workload and was built with collaboration in mind.

If you do not work through the first three steps then you are effectively burning cash and hoping that a vendor’s particular offering meets your requirements.

I enjoy working with partners and suppliers to solve problems, however, I never trust them at face value. This isn’t because I don’t trust the people, it’s because all architectural and engineering decisions require a cost to be paid. This could be financial, cultural, ecosystem lock-in, etc. The point is that you must be able to articulate those costs, justify how you arrived at your product selection, and use the output of Steps 1 through 3 to demonstrate how your automation initiative actually focuses on value. Skipping this and going straight to the tools is a great way to get your request for funding rejected or, even worse, to find that the tools you have selected and invested in do not meet your requirements.

Automation is simple, which makes it tempting and dangerous, in my opinion.

Adopting automated systems of work requires you to confront the fact that maintenance is only created and never destroyed. Just shifted around. You’re effectively pushing maintenance away from network infrastructure towards an automation platform. It is important that you understand what that means from a systems perspective, especially for governance related activities. If you invest in tooling without first understanding this fully then your branches and failure modes will look a bit similar to this…


Image owned by Marvel and used under Fair Use guidelines

Now I can actually discuss why Red Hat.

Well, for starters, I appreciated that the team at Red Hat Perth (Tung U, Sandy Sodhi, Chris Bowden & Shaun Hofer) understood the difficulty associated with thinking about, let alone stepping towards, OT NetDevOps. This has been a four-year journey, folks, where two years centred around understanding whether the product could meet our requirements. This involved attending Perth hands-on sessions, receiving coaching from Red Hat SMEs, and understanding whether the product was fit for use.

For example, I highlighted that I could not consider Ansible Tower until its architecture supported better scaling through a distributed control, data and management plane. Then, we identified that AAP was susceptible to exploitation via external system dependencies, so we needed to wait for digital signing features. Finally, just as all looked good, we found the potential for duplicate licensing costs due to issues with our inventory system. Despite these challenges, Red Hat progressed iteratively with our feedback and did not apply any buy-in pressure. Blake Douglas, the Lead Principal Technology Tooling at BHP, was instrumental in helping me understand how we could design for an implementation that could step towards a larger-scale deployment.

OK, so why Ansible Automation Platform (“AAP”)?

I will take a moment and share the personal opinions of the technical geek, Luke Snell.

I think automation is at its best when kept simple, agnostic, explicitly maps back to your business’ value streams and processes, and when the tools are easy for people to pick up and start tinkering with. I have no issues with “traditional” development practices, such as taking a Pythonic approach to automation. In fact, these were originally explored and Red Hat has been highly transparent about when we would need to consider such approaches. However, it is a steeper learning curve, and obtaining external support cannot be easy. With a product like AAP, we can focus on solving business problems with technology and then outsource the complex work to Red Hat’s business consulting services. They’ll go away, figure out how to automate the business processes, deliver the final product working with their partners whose technology we already use, and ensure Red Hat Support is kept in the loop. I think that developing competency within writing Ansible playbooks helps keep automation simple, as developing in YAML is much simpler than explaining how a person goes from “Hello World” to interacting with infrastructure via Python or Go asynchronously… while also managing library dependencies along the way. I think this is an attractive relationship for an enterprise, as you now have a supportable off-the-shelf product. You can then share your code through something like Git, which other teams can review, improve, and replicate throughout different environments across the business as they deem fit—boom – NetDevOps in practice.

When I look at AAP’s product roadmap, even the secret stuff that has yet to be publicly announced, I am excited that the concept of simplicity remains central to the product’s development. For example, I am excited to spend more time with Event Driven Automation, as it allows me to reframe event monitoring tools as a historian platform while guaranteeing alerts will be consumed and acted upon without getting stuck in middleware integration hell. This will let me close the loop between event monitoring and incident management practices.

I will tease two super cool ideas that we’re exploring. The first is understanding how AI tools, such as Ansible Lightspeed, can reduce the barrier to entry for automation so that we can put these tools in even more people’s hands. The second is understanding what extending digital automation workflows into the physical world means, primarily through the safety lens. As for the specifics? Spoilers!

Final thoughts on automation…

As I wrap up, I want to share that automation everywhere does not necessarily mean automating every process end-to-end. It can mean automating what’s suitable for your organisation at a particular moment. Keep things simple but practical with security always front of mind, so that you can focus on value.

Don’t just sit there and think about it though. Start prototyping and tinkering – now!

Get your frontline operational teams involved and encourage them to challenge the norm, as this will point out the most valuable work that requires streamlining first. You can do this by contacting the folks at Red Hat for a hands-on demo and sharing your challenges and use cases. If you’re cost-constrained, then consider the open-source upstream project Ansible AWX. It’s better to start the ball rolling than to do nothing.

In short – Think Big.

Post Event Reflection

This was my first time speaking in an official capacity at a public forum, and I think it showed.

The process of obtaining internal approvals in an organisation as large as mine was something I was unfamiliar with and had not worked through. This made it difficult to prepare in advance, as I was unsure how much detail I could delve into during my presentation. Despite this, I received much internal support to push ahead and have even been told to consider submitting papers for future events.

I learned that I enjoy public speaking. I wanted to walk around the presentation area to talk, so I looked a bit crazy, trying to remain still. The post-talk networking session allowed me to speak to other like-minded folks. For example, I had a fascinating talk about how one person built a scheduling and load-balancing system to automate the provisioning of supercomputer resources. How cool is that?


From Left to Right: David Rose, Luke Snell (me!), Shaun Hofer
I do not consider myself an automation expert, as I have to learn plenty, but I am glad to share my stories nonetheless!

 


Photo courtesy of Tung U [link]
It is pretty much a packed-out room!

Learning Resources

Consider checking out these non-affiliate links to free and paid resources to help you upskill your knowledge of automation, Ansible, and operational technology.

I do not recommend paid content unless I have personally paid for it and used it.

Ansible

  • (open source) https://github.com/ansible
  • Red Hat Ansible Automation Platform [link]
  • Red Hat Ansible Lightspeed with IBM watsonx Code Assistant [link]

Educational Content

  • (free) Red Hat – Ansible Basics: Automation Technical Overview [link]
  • (paid) Red Hat Learning Subscription [link]
  • (free) Alex Dworjan – YouTube channel full of an excellent repository of Ansible content [link]
  • (free) Network Automation Nerds Podcast [link]
  • (paid) Nick Russo – Automating Networks with Ansible the Right Way [link]
  • (paid) Nick Russo – Automating Multi-Vendor and Cloud Networks Using Ansible [link]
  • (free) The Art of Network Engineering podcast – automation topics: 21, 75, 82, 133
  • (free) The Art of Network Engineering podcast – What is OT vs IT? Josh Varghese from Traceroute LLC [link]
  • (free) The Hedge Podcast: 3, 85, 90, 104, 193, 194, 199, 200, 203, 204
  • (free) RealPars – YouTube channel full of excellent OT and industrial automation content [link]

Special Thanks

Thank you to the BHP Olympic Dam Technology leadership team for continuing to push me to excel and succeed within my profession. Huge shout out to Blake Douglas for being a technical coach throughout the last year!

Thank you to the following people at Red Hat for inviting me to speak at the event and for being patient and supportive of my insanity over the last four years.

Tung U, Sandy Sodhi, Shaun Hofer, Chris Bowden, Simon Delord, Frederick Son