Google Cloud's VM Manager is a suite of tools that can be used to manage virtual machines running on Google Cloud Platform at scale.
One of its services is OS patch management that helps to apply patches to virtual machines on-demand and based on schedules. Both Linux and Windows operating systems are supported and the service uses the respective update infrastructure of the operating system (e.g. apt, ZYpp, yum and Windows Update Agent) to both identify and apply missing patches.
A request that comes up often when talking to customers that plan on using this service or are already using it, is how to create a backup of the state of a virtual machine before patches are applied in order to be able to roll back in case something went wrong with patching or with the patches themselves. Unfortunately this feature is not supported by VM Manager out of the box. One of the capabilities the service does support is the ability to run pre-patch and post-patch scripts on each VM that is targeted for patching.
Scripts running pre-patching or post-patching run on the instance and in the context of the service account that is associated with it (either the Compute Engine default service account or the one that was used during creation).
In the remainder of this article I will explain how pre-patch scripts can be leveraged to create a crash consistent disk clone of the attached persistent disks of a VM before patches are applied.
This article describes a solution to a common customer problem. The ideal solution would be to have a direct integration in the service, that does not rely on executing the snapshot creation on the VM and in the context of the associated service account. Assigning the required permission to the service account ultimately gives these permissions to any user that can login onto the VMs.
By making the patching of a VM dependent on taking a disk clone (this is how the sample script in this article is put together), a failure to create the clone ultimately results in not patching the VM.
Setting up VM Manager and OS patch management is out of the scope of this article. Follow the instructions on Setting up VM Manager to enable VM Manager for your project.
Creating disk clones requires at least the following permissions to be assigned to the service account associated with the VM:
The script that creates the clone ultimately runs on the VM that is being patched. This means that it is not only required to set the correct permission to the service account associated with the VM but the API scope needs to be set as well.
Set the scope to either Allow full access to all Cloud APIs
Note: Irrespective of what you set, scopes just determine which Cloud APIs can be called from the VM instance. IAM permissions act independently of this so even with the scope set to all Cloud APIs, permission is ultimately granted trough IAM permissions.
I've included sample scripts for both Linux and Windows based operating systems at the end of this section. I have tested these scripts Debian 10, Ubuntu 20.04, the latest Container-Optimize OS and Windows Server 2019. If you use different versions, I strongly recommend to test the scripts.
Both versions of the sample script follow the same logic:
- Retrieve the ID of the patch job (used to tag the snapshot for better discoverability)
- Retrieve disks associated with the VM
- Create disk clones
You need to download the appropriate version of the update script and then upload them to a storage bucket (this guide explains how to do just that):
Now we need to get the version of the file we just uploaded. We need to pass along the version so the patch service can pick up the right version for execution:
Create patch job with pre-patch script execution
Now that the scripts have been uploaded we can create patch jobs. These can either be on-demand or scheduled. Additionally they can be configured to target different subsets of VM instances. More information about instance filters can be found in the documentation.
The following samples create on-demand patch jobs targeting all instances. Make sure to supply the correct values for the GCS bucket and the file version for the script.
gcloud compute os-config patch-jobs execute \ --display-name=clone \ --instance-filter-all \ --reboot-config=default \ --pre-patch-linux-executable=gs://<BUCKET>/snapshot-linux.sh#<VERSION> \ --async
gcloud compute os-config patch-jobs execute \ --display-name=clone \ --instance-filter-all \ --reboot-config=default \ --windows-classifications=critical,security \ --pre-patch-windows-executable=gs://<BUCKET>/snapshot-windows.ps1#<VERSION> \ --async
Validate snapshot creation
Patch results / Cloud Logging
Navigate to Compute Engine then OS patch management.
Select Patch Jobs.
Select the job and review its status.
For more details, scroll down in the patch job execution details overlay and select View for a VM that was targeted by this job.
This opens Cloud Logging and contains a detailed log of the script execution.
Navigate to Compute Engine then Disks.
Review the available disks.
The name of the disk clone is the original disk name with the ID of the patch job appended. Additionally a few labels haven been set to make discovery easier:
This article illustrates how the pre-patch and post-patch scripts can be used to automate common enterprise requirements. While there are limitations and considerations to me made this process can be used to secure workloads before patching at scale.