Ansible AWX: Automated Rolling Patching for 500+ Linux Servers [ka83]

Automate patching 500+ Linux servers with Ansible AWX. Learn how to implement rolling batch updates for efficient, scalable infrastructure maintenance.

Overview

In the vast landscape of modern IT infrastructure, managing and maintaining a fleet of Linux servers is a monumental task. When that fleet scales to 500, 1000, or even thousands of instances, the challenges multiply exponentially. One of the most critical, yet often dreaded, operational tasks is patching. Ensuring all servers are up-to-date with the latest security fixes and performance enhancements is paramount for stability, security, and compliance. However, performing this manually or with rudimentary scripts on hundreds of servers can lead to inconsistencies, human error, extended downtime, and operational fatigue.

The traditional "patch Tuesday" approach, where all servers are updated simultaneously, is a relic of the past for large-scale, high-availability environments. It introduces a single point of failure and significantly increases the blast radius for unexpected issues. The modern imperative is to implement a robust, automated, and controlled patching strategy that leverages rolling batches. This approach minimizes risk by updating a small subset of servers at a time, allowing for monitoring and validation before proceeding to the next batch. Should an issue arise, it's confined to a limited number of systems, enabling quick rollback or remediation without impacting the entire infrastructure.

Enter Ansible AWX (or its upstream open-source project, Ansible Tower). AWX transforms Ansible's powerful automation capabilities into an enterprise-grade platform, providing a web-based UI, role-based access control (RBAC), centralized logging, and a sophisticated workflow engine. For patching 500+ Linux servers in rolling batches, AWX is not just a convenience; it's an operational necessity. It provides the orchestration, visibility, and control required to execute complex, multi-stage patching processes reliably and efficiently, reducing operational overhead and drastically improving the security posture of your entire Linux estate. As Sujay Singh, a senior technology writer at TechNews Venture, I've seen firsthand how organizations leverage AWX to move from patching chaos to controlled, repeatable, and auditable automation.

Prerequisites

Before diving into the implementation details, ensure you have the following prerequisites in place:

Ansible AWX Instance: A running and accessible AWX instance. This can be deployed on Kubernetes/OpenShift, or via Docker Compose for smaller, proof-of-concept environments. For production, Kubernetes/OpenShift is highly recommended for scalability and resilience.
Ansible Knowledge: A solid understanding of Ansible playbooks, roles, modules, and inventory management.
SSH Connectivity: The AWX host (specifically, the Ansible execution environment) must have SSH connectivity to all target Linux servers. This typically involves configuring an SSH key pair within AWX and distributing the public key to the ~/.ssh/authorized_keys file for the Ansible user on all target servers.
Sudo Privileges: The Ansible user on the target Linux servers must have passwordless sudo privileges for executing patching commands (e.g., yum update, apt upgrade, reboot). This is often configured via an entry in the /etc/sudoers file, like ansibleuser ALL=(ALL) NOPASSWD: ALL (though more restrictive rules are recommended for production).
Version Control System (VCS): A Git repository (e.g., GitHub, GitLab, Bitbucket, or an internal Git server) to store your Ansible playbooks and inventory configurations. AWX integrates seamlessly with Git.
Network Connectivity: Ensure that target servers can reach their respective package repositories (e.g., EPEL, RHEL repos, Ubuntu PPAs) and that AWX can communicate with all target servers over SSH (port 22). If using dynamic inventory from a cloud provider, AWX also needs network access to the cloud API endpoints.
Cloud Provider Credentials (Optional but Recommended): If using dynamic inventory from cloud providers like AWS, Azure, or GCP, you'll need appropriate API credentials (e.g., AWS IAM access keys with permissions to list EC2 instances).

Step-by-Step Implementation

1. Setting up AWX and Inventory

Assuming your AWX instance is up and running, the first step is to configure your inventory. For 500+ servers, dynamic inventory is crucial. We'll use AWS EC2 as an example, but similar principles apply to other cloud providers or even custom inventory scripts.

1.1. Create an AWS Credential in AWX

Navigate to Credentials in the AWX UI. Click Add.

Name: AWS EC2 Inventory Credentials
Organization: Select your organization
Credential Type: Amazon Web Services
Access Key: AKIAIOSFODNN7EXAMPLE (replace with your actual AWS Access Key ID)
Secret Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY (replace with your actual AWS Secret Access Key)

Ensure the IAM user associated with these keys has permissions like ec2:DescribeInstances, ec2:DescribeRegions, etc., to fetch instance details.

1.2. Create an AWS EC2 Dynamic Inventory Source

In your Git repository, create a file named aws_ec2.yml. This is the inventory plugin configuration.


# ~/ansible-repo/inventory/aws_ec2.yml
plugin: aws_ec2
regions:
  - us-east-1
  - us-west-2
filters:
  instance-state-name: running
  tag:Environment: production
  tag:PatchGroup: LinuxServers
keyed_groups:
  - key: tags.PatchGroup
    prefix: patch_group_
  - key: tags.OS
    prefix: os_
compose:
  ansible_host: public_ip_address
  ansible_user: ec2-user

This configuration tells the aws_ec2 plugin to discover running instances in us-east-1 and us-west-2, specifically those tagged with Environment: production and PatchGroup: LinuxServers. It creates groups based on the PatchGroup and OS tags and sets the ansible_host to the public IP and ansible_user to ec2-user. For private networks, you might use private_ip_address and ensure AWX can reach those IPs.

Next, in AWX UI:

Navigate to Inventories and click Add to create a new Inventory.
Name: Production Linux Servers
Organization: Select your organization
Save the inventory.

Now, within the newly created inventory, go to the Sources tab and click Add.

Name: AWS EC2 Dynamic Source
Source: Amazon EC2
Credential: Select the AWS EC2 Inventory Credentials you created.
Source Path: inventory/aws_ec2.yml (path within your Git repository)
Check Update on launch and Overwrite (if you want AWX to always pull the latest from AWS).

Save and initiate a sync. You should see your EC2 instances populate the inventory.

2. Creating Ansible Playbooks for Patching

Store these playbooks in your Git repository. Let's assume your repository is named ansible-repo.

2.1. Basic Patching Playbook (`patch_servers.yml`)


# ~/ansible-repo/playbooks/patch_servers.yml
---
- name: Apply OS patches to Linux servers
  hosts: all
  become: yes
  gather_facts: yes

  vars:
    reboot_required_file: /var/run/reboot-required
    reboot_marker_file: /tmp/ansible_reboot_marker

  tasks:
    - name: Ensure system is updated (yum for RHEL/CentOS)
      ansible.builtin.yum:
        name: "*"
        state: latest
        update_cache: yes
      when: ansible_facts['os_family'] == "RedHat"
      register: yum_update_result

    - name: Ensure system is updated (apt for Debian/Ubuntu)
      ansible.builtin.apt:
        update_cache: yes
        upgrade: dist
        autoclean: yes
        autoremove: yes
      when: ansible_facts['os_family'] == "Debian"
      register: apt_update_result

    - name: Check if reboot is required (RedHat)
      ansible.builtin.stat:
        path: /usr/bin/needs-restarting
      register: needs_restarting_bin
      when: ansible_facts['os_family'] == "RedHat"

    - name: Check if reboot is required (Debian)
      ansible.builtin.stat:
        path: "{{ reboot_required_file }}"
      register: reboot_required_stat
      when: ansible_facts['os_family'] == "Debian"

    - name: Set reboot required flag for RedHat
      ansible.builtin.set_fact:
        reboot_required: true
      when:
        - ansible_facts['os_family'] == "RedHat"
        - needs_restarting_bin.stat.exists
        - "'Kernel' in (ansible.builtin.command('needs-restarting -r').stdout | default(''))" # Check if kernel needs restart

    - name: Set reboot required flag for Debian
      ansible.builtin.set_fact:
        reboot_required: true
      when:
        - ansible_facts['os_family'] == "Debian"
        - reboot_required_stat.stat.exists

    - name: Create reboot marker file if reboot is required
      ansible.builtin.file:
        path: "{{ reboot_marker_file }}"
        state: touch
        mode: '0644'
      when: reboot_required | default(false)

    - name: Perform reboot if required
      ansible.builtin.reboot:
        reboot_timeout: 600
      when: reboot_required | default(false)

2.2. Pre-Patch Health Check Playbook (`pre_patch_check.yml`)


# ~/ansible-repo/playbooks/pre_patch_check.yml
---
- name: Perform pre-patch health checks
  hosts: all
  become: no
  gather_facts: yes

  tasks:
    - name: Check disk space
      ansible.builtin.shell: df -h / | grep -v Filesystem | awk '{print $5}' | sed 's/%//g'
      register: disk_usage
      changed_when: false

    - name: Fail if disk usage is above 90%
      ansible.builtin.fail:
        msg: "Disk usage on / is {{ disk_usage.stdout }}%, exceeding 90% threshold."
      when: disk_usage.stdout | int > 90

    - name: Check critical services status (example: httpd)
      ansible.builtin.systemd_service:
        name: httpd
        state: started
        enabled: yes
      register: httpd_status
      ignore_errors: true # Service might not exist on all machines

    - name: Log httpd service status
      ansible.builtin.debug:
        msg: "HTTPD service status: {{ httpd_status.status.ActiveState | default('N/A') }}"
      when: httpd_status.status is defined

    - name: Check memory usage (example)
      ansible.builtin.shell: free -m | grep Mem | awk '{print $3/$2 * 100.0}'
      register: mem_usage
      changed_when: false

    - name: Fail if memory usage is above 95%
      ansible.builtin.fail:
        msg: "Memory usage is {{ mem_usage.stdout | round(2) }}%, exceeding 95% threshold."
      when: mem_usage.stdout | float > 95.0

    - name: Report successful pre-patch checks
      ansible.builtin.debug:
        msg: "Pre-patch health checks passed successfully."

2.3. Post-Patch Health Check Playbook (`post_patch_check.yml`)


# ~/ansible-repo/playbooks/post_patch_check.yml
---
- name: Perform post-patch health checks
  hosts: all
  become: no
  gather_facts: yes

  tasks:
    - name: Wait for server to be reachable after reboot (if it happened)
      ansible.builtin.wait_for_connection:
        timeout: 300
      when: ansible.builtin.stat(path='/tmp/ansible_reboot_marker').stat.exists | default(false)

    - name: Remove reboot marker file
      ansible.builtin.file:
        path: /tmp/ansible_reboot_marker
        state: absent
      when: ansible.builtin.stat(path='/tmp/ansible_reboot_marker').stat.exists | default(false)

    - name: Check critical services status again (example: httpd)
      ansible.builtin.systemd_service:
        name: httpd
        state: started
        enabled: yes
      register: httpd_status_post
      ignore_errors: true

    - name: Fail if httpd service is not running
      ansible.builtin.fail:
        msg: "HTTPD service is not running after patching."
      when: httpd_status_post.status is defined and httpd_status_post.status.ActiveState != 'active'

    - name: Verify kernel version (example for successful patch)
      ansible.builtin.debug:
        msg: "Current kernel version: {{ ansible_facts['kernel'] }}"

    - name: Report successful post-patch checks
      ansible.builtin.debug:
        msg: "Post-patch health checks passed successfully."

3. Integrating Playbooks into AWX

3.1. Create a Project in AWX

Navigate to Projects in the AWX UI. Click Add.

Name: Linux Patching Playbooks
Organization: Select your organization
SCM Type: Git
SCM URL: https://github.com/your-org/ansible-repo.git (replace with your Git repo URL)
SCM Branch/Tag/Commit: main (or your preferred branch)
SCM Credential: (Optional, if your repo is private, create a Git credential first)

Save and perform an SCM Update. This will pull your playbooks into AWX.

3.2. Create an SSH Credential for Target Hosts

Navigate to Credentials. Click Add.

Name: Linux Server SSH Key
Organization: Select your organization
Credential Type: Machine
Username: ec2-user (or your Ansible user)
SSH Private Key: Paste your private SSH key here.

This credential will be used by AWX to connect to your Linux servers.

3.3. Create Job Templates

For each playbook, create a Job Template. Navigate to Job Templates, click Add.

Job Template for Pre-Patch Check:

Name: Pre-Patch Health Check
Job Type: Run
Inventory: Production Linux Servers
Project: Linux Patching Playbooks
Playbook: playbooks/pre_patch_check.yml
Credential: Linux Server SSH Key
Forks: 20 (Adjust based on your environment's capacity)
Limit: (Leave blank for now; will be set in workflow)
Prompt on Launch: Check Limit and Extra Variables.

Job Template for Patching:

Name: Apply Linux Patches
Job Type: Run
Inventory: Production Linux Servers
Project: Linux Patching Playbooks
Playbook: playbooks/patch_servers.yml
Credential: Linux Server SSH Key
Privilege Escalation: Check Enable Privilege Escalation (and ensure sudo is selected).
Forks: 20
Limit: (Leave blank)
Prompt on Launch: Check Limit and Extra Variables.

Job Template for Post-Patch Check:

Name: Post-Patch Health Check
Job Type: Run
Inventory: Production Linux Servers
Project: Linux Patching Playbooks
Playbook: playbooks/post_patch_check.yml
Credential: Linux Server SSH Key
Forks: 20
Limit: (Leave blank)
Prompt on Launch: Check Limit and Extra Variables.

4. Implementing Rolling Batches with AWX Workflows

This is where AWX truly shines for large-scale operations. While Ansible's serial keyword can manage rolling updates *within* a single playbook execution, AWX Workflows provide superior control, visibility, and error handling for orchestrating multiple playbooks across distinct batches of servers.

4.1. Defining Batches

For 500+ servers, you'll want to divide them into manageable batches (e.g., 50 servers per batch). You can achieve this using host groups defined in your dynamic inventory or by leveraging tags/labels that you can filter on with the --limit parameter (which AWX passes as the Limit field).

Let's assume your AWS EC2 instances have a tag called BatchGroup with values like batch-01, batch-02, ..., batch-10. Your dynamic inventory will create groups like patch_group_batch_01, patch_group_batch_02, etc.

4.2. Creating a Workflow Template

Navigate to Workflow Templates in the AWX UI. Click Add.

Name: Linux Server Rolling Patch Workflow
Organization: Select your organization
Inventory: Production Linux Servers (This sets the default inventory for all jobs in the workflow, but individual jobs can override it or use the workflow's limit)

Save the workflow. Now, click on the Visualizer tab.

Here, you'll visually construct the patching process. For each batch (e.g., batch-01, batch-02), you'll create a sequence of Pre-Check -> Patch -> Post-Check job templates.

Example Workflow for Batch-01:

Click Start node -> Add Job Template.
- Select Pre-Patch Health Check.
- Node Type: Job Template
- Limit: patch_group_batch_01
- Edge Type: On Success
From the Pre-Patch Health Check (Batch-01) node -> Add Job Template.
- Select Apply Linux Patches.
- Node Type: Job Template
- Limit: patch_group_batch_01
- Edge Type: On Success
From the Apply Linux Patches (Batch-01) node -> Add Job Template.
- Select Post-Patch Health Check.
- Node Type: Job Template
- Limit: patch_group_batch_01
- Edge Type: On Success

Repeat this sequence for batch-02, connecting the start of Pre-Patch Health Check (Batch-02) with the On Success edge of Post-Patch Health Check (Batch-01). Continue this for all your batches. This ensures that each batch is processed sequentially, and the next batch only starts if the previous one completed successfully.

Your workflow will look like a chain: Start -> Pre-Check (B1) -> Patch (B1) -> Post-Check (B1) -> Pre-Check (B2) -> Patch (B2) -> Post-Check (B2) -> ...

Using Extra Variables for Dynamic Limiting:
Instead of hardcoding the limit for each node, you can define the limit at the workflow launch time. In the Workflow Template, go to Details and check Prompt on Launch for Extra Variables.

Then, for each node in the visualizer, in the Extra Variables field, you can use:


---
limit: "{{ batch_group }}"

When you launch the workflow, you'll be prompted for batch_group. You can then specify patch_group_batch_01, patch_group_batch_02, etc., to run the workflow for a specific batch. For a full rolling batch across multiple groups, you would still need distinct workflow nodes for each batch, but this approach gives more flexibility if you only want to run a specific batch outside the full chain.

For a truly dynamic rolling batch where you don't want to create 10+ identical chains, you could use a single 'Pre-Patch', 'Patch', 'Post-Patch' sequence and use the workflow itself to loop through groups. This is often achieved by creating a "controller" playbook that iterates over groups and launches sub-workflows or job templates via the AWX API or the tower_job_launch module. However, for most cases, explicitly defining the sequential batches in the visualizer provides clear visibility and control.

For a simpler approach to manage 500+ servers in 10 batches, you could define 10 groups in your inventory, e.g., `batch_1`, `batch_2`, ..., `batch_10`. Then, in the workflow visualizer, each node's limit would be explicitly set to `batch_1`, `batch_2`, etc.

Example for a single chain for a specific batch:


# Job Template for Pre-Patch Health Check
# Limit: batch_1

# Job Template for Apply Linux Patches
# Limit: batch_1

# Job Template for Post-Patch Health Check
# Limit: batch_1

And then connect these sequentially. To run for `batch_2`, you'd launch another workflow or create a separate chain. For 500+ servers, this approach of chaining jobs explicitly for each batch is common and provides robust control.

5. Monitoring and Reporting

AWX Dashboard: The AWX dashboard provides a real-time view of running, pending, and completed jobs and workflows.
Job Output: Each job template execution produces detailed logs, showing every task's status, output, and any errors. This is invaluable for troubleshooting.
Notifications: Configure email notifications (under Settings -> Notifications) to receive alerts on workflow success, failure, or other events.
External Logging: AWX can integrate with external logging systems. You can configure AWX to send job events to Splunk, ELK stack, or other SIEMs for centralized logging and auditing. This is done via Receptor or custom callback plugins.
AWX API: Leverage the AWX REST API to programmatically retrieve job status, results, and generate custom reports. This can be integrated into existing reporting dashboards or change management systems.

Security Considerations

Security is paramount, especially when dealing with system-wide patching automation.

Least Privilege:
- AWX Credentials: Ensure AWS/cloud credentials only have permissions necessary for inventory discovery (e.g., ec2:DescribeInstances). SSH keys for target hosts should be dedicated for automation and secured.
- Ansible User on Targets: The Ansible user on target Linux servers should have the minimum necessary sudo privileges. Instead of NOPASSWD: ALL, consider specifying exact commands like NOPASSWD: /usr/bin/yum update, /usr/bin/apt upgrade, /sbin/reboot.
Vault for Sensitive Data: Use Ansible Vault to encrypt any sensitive data (e.g., non-SSH passwords, API tokens) within your playbooks or extra variables, even if they are in a private Git repository. AWX can decrypt Vault-encrypted files if provided with the Vault password.
AWX RBAC: Implement strict Role-Based Access Control within AWX. Limit who can create/modify credentials, projects, job templates, and especially who can launch the patching workflow. Separate duties between those who write playbooks and those who execute them.

Ansible AWX: Automated Rolling Patching for 500+ Linux Servers [ka83]

Overview

Prerequisites

Step-by-Step Implementation

1. Setting up AWX and Inventory

1.1. Create an AWS Credential in AWX

1.2. Create an AWS EC2 Dynamic Inventory Source

2. Creating Ansible Playbooks for Patching

2.1. Basic Patching Playbook (`patch_servers.yml`)

2.2. Pre-Patch Health Check Playbook (`pre_patch_check.yml`)

2.3. Post-Patch Health Check Playbook (`post_patch_check.yml`)

3. Integrating Playbooks into AWX

3.1. Create a Project in AWX

3.2. Create an SSH Credential for Target Hosts

3.3. Create Job Templates

4. Implementing Rolling Batches with AWX Workflows

4.1. Defining Batches

4.2. Creating a Workflow Template

5. Monitoring and Reporting

Security Considerations

Enjoyed this article?

Leave a Comment

Ansible AWX: Automated Rolling Patching for 500+ Linux Servers [ka83]

Overview

Prerequisites

Step-by-Step Implementation

1. Setting up AWX and Inventory

1.1. Create an AWS Credential in AWX

1.2. Create an AWS EC2 Dynamic Inventory Source

2. Creating Ansible Playbooks for Patching

2.1. Basic Patching Playbook (patch_servers.yml)

2.2. Pre-Patch Health Check Playbook (pre_patch_check.yml)

2.3. Post-Patch Health Check Playbook (post_patch_check.yml)

3. Integrating Playbooks into AWX

3.1. Create a Project in AWX

3.2. Create an SSH Credential for Target Hosts

3.3. Create Job Templates

4. Implementing Rolling Batches with AWX Workflows

4.1. Defining Batches

4.2. Creating a Workflow Template

5. Monitoring and Reporting

Security Considerations

Enjoyed this article?

Leave a Comment

2.1. Basic Patching Playbook (`patch_servers.yml`)

2.2. Pre-Patch Health Check Playbook (`pre_patch_check.yml`)

2.3. Post-Patch Health Check Playbook (`post_patch_check.yml`)