Create disk clone before patching VMs with VM Manager

Create disk clone before patching VMs with VM Manager

Google Cloud's VM Manager is a suite of tools that can be used to manage virtual machines running on Google Cloud Platform at scale.

One of its services is OS patch management that helps to apply patches to virtual machines on-demand and based on schedules. Both Linux and Windows operating systems are supported and the service uses the respective update infrastructure of the operating system (e.g. apt, ZYpp, yum and Windows Update Agent) to both identify and apply missing patches.

A request that comes up often when talking to customers that plan on using this service or are already using it, is how to create a backup of the state of a virtual machine before patches are applied in order to be able to roll back in case something went wrong with patching or with the patches themselves. Unfortunately this feature is not supported by VM Manager out of the box. One of the capabilities the service does support is the ability to run pre-patch and post-patch scripts on each VM that is targeted for patching.

Scripts running pre-patching or post-patching run on the instance and in the context of the service account that is associated with it (either the Compute Engine default service account or the one that was used during creation).

In the remainder of this article I will explain how pre-patch scripts can be leveraged to create a crash consistent disk clone of the attached persistent disks of a VM before patches are applied.

Considerations

This article describes a solution to a common customer problem. The ideal solution would be to have a direct integration in the service, that does not rely on executing the snapshot creation on the VM and in the context of the associated service account. Assigning the required permission to the service account ultimately gives these permissions to any user that can login onto the VMs.

By making the patching of a VM dependent on taking a disk clone (this is how the sample script in this article is put together), a failure to create the clone ultimately results in not patching the VM.

Prerequisites

Setting up VM Manager and OS patch management is out of the scope of this article. Follow the instructions on Setting up VM Manager to enable VM Manager for your project.

Permissions

Creating disk clones requires at least the following permissions to be assigned to the service account associated with the VM:

compute.disks.create # on the project
compute.disks.createSnapshot # on the source disk
Required API permissions 

Scopes

The script that creates the clone ultimately runs on the VM that is being patched. This means that it is not only required to set the correct permission to the service account associated with the VM but the API scope needs to be set as well.

Set the scope to either Allow full access to all Cloud APIs

Allow full access to all Cloud APIs
Note: Irrespective of what you set, scopes just determine which Cloud APIs can be called from the VM instance. IAM permissions act independently of this so even with the scope set to all Cloud APIs, permission is ultimately granted trough IAM permissions.

Upload scripts

I've included sample scripts for both Linux and Windows based operating systems at the end of this section. I have tested these scripts Debian 10, Ubuntu 20.04, the latest Container-Optimize OS and Windows Server 2019. If you use different versions, I strongly recommend to test the scripts.

Both versions of the sample script follow the same logic:

  1. Retrieve the ID of the patch job (used to tag the snapshot for better discoverability)
  2. Retrieve disks associated with the VM
  3. Create disk clones

You need to download the appropriate version of the update script and then upload them to a storage bucket (this guide explains how to do just that):

# Copy script to GCS bucket
gsutil cp clone-linux.sh gs://<BUCKET>/clone-linux.sh
Upload script(s) to GCS bucket

Now we need to get the version of the file we just uploaded. We need to pass along the version so the patch service can pick up the right version for execution:

# Retrieve file version
gsutil ls -a gs://<BUCKET>/clone-linux.sh | cut -d'#' -f 2
Retrieve file version

Linux

Find the latest version on GitHub.

Windows

Find the latest version on GitHub.

Create patch job with pre-patch script execution

Now that the scripts have been uploaded we can create patch jobs. These can either be on-demand or scheduled. Additionally they can be configured to target different subsets of VM instances. More information about instance filters can be found in the documentation.

The following samples create on-demand patch jobs targeting all instances. Make sure to supply the correct values for the GCS bucket and the file version for the script.

Linux

gcloud compute os-config patch-jobs execute \
    --display-name=clone \
    --instance-filter-all \
    --reboot-config=default \
    --pre-patch-linux-executable=gs://<BUCKET>/snapshot-linux.sh#<VERSION> \
    --async

Windows

gcloud compute os-config patch-jobs execute \
    --display-name=clone \
    --instance-filter-all \
    --reboot-config=default \
    --windows-classifications=critical,security \
    --pre-patch-windows-executable=gs://<BUCKET>/snapshot-windows.ps1#<VERSION> \
    --async

Validate snapshot creation

Patch results / Cloud Logging

Navigate to Compute Engine then OS patch management.

Navigate to OS patch management

Select Patch Jobs.

Select Patch Jobs in the OS patch management overview

Select the job and review its status.

For more details, scroll down in the patch job execution details overlay and select View for a VM that was targeted by this job.

Access Cloud Logging to see logs for a VM instance targeted by a patch job

This opens Cloud Logging and contains a detailed log of the script execution.

Detailed log of the patch process (including pre-patch script) in Cloud Logging

Clones

Navigate to Compute Engine then Disks.

Navigate to Disks

Review the available disks.

Filtered disks within the selected project

The name of the disk clone is the original disk name with the ID of the patch job appended. Additionally a few labels haven been set to make discovery easier:

Disk details including labels indicating the patch job

Conclusion

This article illustrates how the pre-patch and post-patch scripts can be used to automate common enterprise requirements. While there are limitations and considerations to me made this process can be used to secure workloads before patching at scale.

Show Comments