Patching Windows VMs with GCP’s VM Manager

Published in

Appsbroker CTS Google Cloud Tech Blog

6 min readOct 15, 2020

Introduction

Whilst I am a huge fan of short lived, immutable VMs with system state turned over to managed services like Cloud SQL, sometimes this simply isn’t practical or possible.

In these situations we are often left with long running, stateful instances that require the same sort of maintenance as ‘traditional’ infrastructure. But how do we manage these without the pain and ancillary infrastructure this often requires.

Recently I have been working with a client that has a stateful .NET application that runs on top of Windows Server 2016 GCE Instances. The client tasked me with determining a patching strategy for these VMs to ensure stability and more importantly security.

Now I have plenty of experience patching Windows servers from previous jobs and have seen it done both well and badly. Some particular bad examples that stick out to me are:

Logging in once a month to spend half a day patching a PCI CDE manually
Having business critical, clustered services with different nodes running at entirely different patch levels (and not to test for a short period!)
Requiring nearly a fortnight of application testing in non-production before roll-out could be authorized for production.

Over the course of my career I have had the pleasure (or misfortune) to have worked with both WSUS and SCCM. These tools however can be complex and temperamental and in the case of SCCM require deep pockets.

Generally I have found the best patching strategies to follow these ideas:

Automated Management with little human interaction, people get busy and patching too often is the task that gets kicked down the road to new month.
Regular patching to ensure baseline is not far behind latest In my experience having too much drift can cause support problems with vendors for whom the default response is “ensure you are running the latest patches”
Deploying to non-production environments before production. I like to think this goes without saying but in the (hopefully unlikely) event a patch causes issues it should be caught in non-prod first.

Anyway back to the my client’s request. In April Google launched their OS patch management service to the masses. This is a tool I was curious about but hadn’t the requirement to implement it at the time. This though I felt was the right opportunity to test it.

Prerequisites and Setup

Initially I wanted to see what the tool could show me with the OS Inventory Management functionality before committing to patching, as I did this in Terraform here is my code:

resource "google_compute_project_metadata_item" "guest_attributes" {  key   = "enable-guest-attributes"  
value = "TRUE"
} resource "google_compute_project_metadata_item" "osconfig" {  
key   = "enable-osconfig"  
value = "TRUE"
} resource "google_project_service" "osconfig" {  
service = "osconfig.googleapis.com"  
disable_on_destroy = false
}

The above is really the result of the setup requirements found in the Google documentation but in summary requires a couple of metadata values to be set and API to be enabled.

The next step is to ensure that the OS you want to monitor has the required agent, fortunately as the instances used a recent Google baked 2016 image I could skip this step. If however you aren't so fortunate installation instructions are here.

At this point I also want to state that Automatic Windows Updates had been disabled previously by setting a registry key in the startup script using the Powershell below.

Set-ItemProperty -Path HKLM:\SOFTWARE\Policies\Microsoft\Windows\WindowsUpdate\AU -Name AUOptions -Value 1

Compliance Graphs

Like most ops guys I do like a pretty graph (and could probably stare at Grafana dashboards for hours!) and in this regard Google doesn’t disappoint with clear reporting on a pie chart, even broken down by OS.

For the eagle eyed among you the three VMs reporting back no data are actually GKE Nodes running Google’s Container OS that they are responsible for maintaining.

By selecting view details then selecting a specific VM I can get a breakdown of available patches including their categories, KB numbers and when they were published.

Turning Insight into Action

So satisfied with the insight I now had into the VMs I was keen to try out the functionality to apply patches. This is done by creating a Google OS Patch Deployment, this can be done in the Console or with Terraform.

resource "google_os_config_patch_deployment" "win-patch" {
  patch_deployment_id = "win-patch"instance_filter {
    group_labels {
      labels = {
        win-patch = "true"
      }
    }zones = ["europe-west2-a", "europe-west2-b", "europe-west2-c"]
  }patch_config {
    reboot_config = "DEFAULT"windows_update {
      classifications = ["CRITICAL", "SECURITY", "DEFINITION"]
    }
  }duration = "3600s"recurring_schedule {
    time_zone {
      id = "Europe/London"
    }time_of_day {
      hours = 2
    }weekly {
      day_of_week = var.win_patch_day
    }
  }rollout {
    mode = "ZONE_BY_ZONE"
    disruption_budget {
      fixed = 5
    }
  }
}

The Terraform Documentation for this resource is quite an interesting read. There are for example options available for pre and post patching scripts which may be very useful in some environments where automated testing exists.

To break down the code above it:

Targets VMs with the label win-patch = “true” that exist in all three of the europe-west2 zones.
The reboot config is set to default which in Windows terms means only rebooting if required.
A weekly patch run at 2AM on a day of the week I have variablised (to allow me to set different days for non-prod and prod environments)
A rollout plan which allows all 5 VMs to be disrupted simultaneously. (In theory you can put a percentage in this field but I had difficulty doing so)

As another quick a note this above resource is new and was only added in provider version 3.30.0. I was required to update to a newer provider as we were a couple of versions behind.

Reviewing Patch Jobs

When a patch job is executed its progress can be watched in real time or more likely (with patching done in the early hours) reviewed the following morning within the VM Management .

It is also possible to drill even further into the logs on specific machines to see how the job progressed. As you can see this job also required a reboot which the machine executed automatically.

Concluding Thoughts

Simply I am a fan, this solution has enabled me to deploy Windows Patches in a automated, reasonably (though not completely) controlled way with minimal to no cost. However I did have to make some compromises and assumptions:

Microsoft typically releases patches on the 2nd Tuesday of the month in what has affectionately been coined “Patch Tuesday”. This means that I have configured my schedule so that non-production environments get patched on Thursday with production the following Monday. So in theory this means non-production will be patched first with production following unless Microsoft release patches out of their usual routine.
I am also trusting the Windows update source configured in my machine to be ‘safe’. As it is left as the default this should be Microsoft (or potentially Google). But Google themselves state that for absolute control they recommend a WSUS server be deployed as part of the OS Patch Management Solution. I felt in this environment deploying such would be overkill and would require additional unwelcome management.

Finally I hope this post has given some food for thought and at least presented another method to help maintain long lived GCP VMs.