Patching Windows VMs with GCP’s VM Manager

Windows Patch

Introduction

In these situations we are often left with long running, stateful instances that require the same sort of maintenance as ‘traditional’ infrastructure. But how do we manage these without the pain and ancillary infrastructure this often requires.

Recently I have been working with a client that has a stateful .NET application that runs on top of Windows Server 2016 GCE Instances. The client tasked me with determining a patching strategy for these VMs to ensure stability and more importantly security.

Now I have plenty of experience patching Windows servers from previous jobs and have seen it done both well and badly. Some particular bad examples that stick out to me are:

  • Logging in once a month to spend half a day patching a PCI CDE manually
  • Having business critical, clustered services with different nodes running at entirely different patch levels (and not to test for a short period!)
  • Requiring nearly a fortnight of application testing in non-production before roll-out could be authorized for production.

Over the course of my career I have had the pleasure (or misfortune) to have worked with both WSUS and SCCM. These tools however can be complex and temperamental and in the case of SCCM require deep pockets.

Generally I have found the best patching strategies to follow these ideas:

  • Automated Management with little human interaction, people get busy and patching too often is the task that gets kicked down the road to new month.
  • Regular patching to ensure baseline is not far behind latest In my experience having too much drift can cause support problems with vendors for whom the default response is “ensure you are running the latest patches”
  • Deploying to non-production environments before production. I like to think this goes without saying but in the (hopefully unlikely) event a patch causes issues it should be caught in non-prod first.

Anyway back to the my client’s request. In April Google launched their OS patch management service to the masses. This is a tool I was curious about but hadn’t the requirement to implement it at the time. This though I felt was the right opportunity to test it.

Prerequisites and Setup

resource "google_compute_project_metadata_item" "guest_attributes" {  key   = "enable-guest-attributes"  
value = "TRUE"
}
resource "google_compute_project_metadata_item" "osconfig" {
key = "enable-osconfig"
value = "TRUE"
}
resource "google_project_service" "osconfig" {
service = "osconfig.googleapis.com"
disable_on_destroy = false
}

The above is really the result of the setup requirements found in the Google documentation but in summary requires a couple of metadata values to be set and API to be enabled.

The next step is to ensure that the OS you want to monitor has the required agent, fortunately as the instances used a recent Google baked 2016 image I could skip this step. If however you aren't so fortunate installation instructions are here.

At this point I also want to state that Automatic Windows Updates had been disabled previously by setting a registry key in the startup script using the Powershell below.

Set-ItemProperty -Path HKLM:\SOFTWARE\Policies\Microsoft\Windows\WindowsUpdate\AU -Name AUOptions -Value 1

Compliance Graphs

For the eagle eyed among you the three VMs reporting back no data are actually GKE Nodes running Google’s Container OS that they are responsible for maintaining.

By selecting view details then selecting a specific VM I can get a breakdown of available patches including their categories, KB numbers and when they were published.

Turning Insight into Action

resource "google_os_config_patch_deployment" "win-patch" {
patch_deployment_id = "win-patch"
instance_filter {
group_labels {
labels = {
win-patch = "true"
}
}
zones = ["europe-west2-a", "europe-west2-b", "europe-west2-c"]
}
patch_config {
reboot_config = "DEFAULT"
windows_update {
classifications = ["CRITICAL", "SECURITY", "DEFINITION"]
}
}
duration = "3600s"recurring_schedule {
time_zone {
id = "Europe/London"
}
time_of_day {
hours = 2
}
weekly {
day_of_week = var.win_patch_day
}
}
rollout {
mode = "ZONE_BY_ZONE"
disruption_budget {
fixed = 5
}
}
}

The Terraform Documentation for this resource is quite an interesting read. There are for example options available for pre and post patching scripts which may be very useful in some environments where automated testing exists.

To break down the code above it:

  • Targets VMs with the label win-patch = “true” that exist in all three of the europe-west2 zones.
  • The reboot config is set to default which in Windows terms means only rebooting if required.
  • A weekly patch run at 2AM on a day of the week I have variablised (to allow me to set different days for non-prod and prod environments)
  • A rollout plan which allows all 5 VMs to be disrupted simultaneously. (In theory you can put a percentage in this field but I had difficulty doing so)

As another quick a note this above resource is new and was only added in provider version 3.30.0. I was required to update to a newer provider as we were a couple of versions behind.

Reviewing Patch Jobs

It is also possible to drill even further into the logs on specific machines to see how the job progressed. As you can see this job also required a reboot which the machine executed automatically.

Concluding Thoughts

  • Microsoft typically releases patches on the 2nd Tuesday of the month in what has affectionately been coined “Patch Tuesday”. This means that I have configured my schedule so that non-production environments get patched on Thursday with production the following Monday. So in theory this means non-production will be patched first with production following unless Microsoft release patches out of their usual routine.
  • I am also trusting the Windows update source configured in my machine to be ‘safe’. As it is left as the default this should be Microsoft (or potentially Google). But Google themselves state that for absolute control they recommend a WSUS server be deployed as part of the OS Patch Management Solution. I felt in this environment deploying such would be overkill and would require additional unwelcome management.

Finally I hope this post has given some food for thought and at least presented another method to help maintain long lived GCP VMs.

I am a GCP Platform Engineer based in the UK. Thoughts here are my own and don’t necessarily represent my employer.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store