[GO] Automatically stop unnecessary resources in GCP verification environment

Hello, this is @ ishii1648 of All About SRE belong.

This article is the 25th day article of All About Group (All About Co., Ltd.) Advent Calendar 2020.

Overview

One of the problems in the cloud verification environment is that you forget to stop the resources you built for testing and the cost increases. A long time ago, I think it was enough to write a script that automatically stops GCE (EC2 for AWS) resources at around 9 pm, but now there are many resources other than GCE that you may forget to stop.

Since All About uses GKE, I often forget to stop the GKE instance group built in the verification environment.

Based on the above situation, if you want to effectively prevent forgetting to stop, we think that it is necessary to develop an integrated mechanism that can automatically stop resources other than GCE (GKE, Cloud SQL, Memory Store), and the script introduced this time Was implemented.

GCP service used

Cloud Scheduler
Cloud Pub/Sub
Cloud Functions

Roughly designed

When implementing it, I considered the following as a design policy.

--Reduce the amount of code to be implemented for each resource type --Slack notification of actually stopped instance --Hardcode the environment name so that it only works in the verification environment --Adopt structured logging

The log output by Cloud Functions can be confirmed on Cloud Logging, but at this time, if the log is structured according to the Cloud Logging format, severity can be identified by color, or with a specific value. I decided to use structured logging because it would be possible to filter.

Implementation details

From here, I will explain using the actual code as an example. I'm sorry I couldn't expose all the code because it's managed in the company's private repository, but I picked up only the important parts. Rather than reading the code carefully, I think it will be more like having you look at it with an image that gives you a rough idea of the whole thing. By the way, I am writing in Go.

The following is an abstraction of the implementation of operations on various resources. The file for stopping various resources is created with the name * _patroller.go, but the method to be defined for each is declared in interface, and all instances are generated when NewGCPResourcePatroller () is called from the client code. To do.

var factories = make(map[string]func(ctx context.Context, projectID string) (Patroller, error))

// API call interval
const CallInterval = 50 * time.Millisecond

type Patroller interface {
	Scan() error
	TerminateOrDestroy(*sync.WaitGroup, interface{})
	PrintReport() (slack.WebhookMessage, error)
	GetScanedResults() []interface{}
	GetTerminatedResults() []interface{}
	GetResourceType() string
}

type patroller struct {
	resourceType string
	results      map[stateType][]interface{}
	projectID    string
	ctx          context.Context
}

type GCPResourcePatroller struct {
	Patrollers map[string]Patroller
}

func registerPatroller(patroller string, factory func(ctx context.Context, projectID string) (Patroller, error)) {
	factories[patroller] = factory
}

func NewGCPResourcePatroller(ctx context.Context, projectID string) (*GCPResourcePatroller, error) {
	patrollers := make(map[string]Patroller)
	for key, factory := range factories {
		patroller, err := factory(ctx, projectID)
		if err != nil {
			return nil, err
		}
		patrollers[key] = patroller
	}
	return &GCPResourcePatroller{Patrollers: patrollers}, nil
}

The following is the client code part. When you call NewGCPResourcePatroller (), the instances of all resources are returned as an array, so each instance is fetched in a loop and the operation is executed.

Since the method is defined by interface, it is a polymorphic implementation that does not notice the difference in resources.

In addition, as the operation flow, [1] the list of resources to be stopped is identified, [2] the stop API is called, [3] Slack creates a message to be notified, and then [4] the notification is made.

patroller.Scan()
o.terminateOrDestroyWithDryrun(patroller)
patroller.PrintReport()
slack.PostWebhook(o.SlackWebhook, &report)

func NewOperator(ctx context.Context, projectID string, webhook string, dryrun bool, debug bool) *Operator {
	log.SetLogger(debug)

	c, err := patroller.NewGCPResourcePatroller(ctx, projectID)
	if err != nil {
		log.Error("", err)
		os.Exit(1)
	}

	return &Operator{
		Patrollers:   c.Patrollers,
		SlackWebhook: webhook,
		DryRun:       dryrun,
	}
}

func (o *Operator) Run() {
	for _, patroller := range o.Patrollers {
		patroller.Scan()

		o.terminateOrDestroyWithDryrun(patroller)

		if len(patroller.GetTerminatedResults()) == 0 {
			log.Info(patroller.GetResourceType(), "no terminated resources. skip to send slack message")
			continue
		}

		report, err := patroller.PrintReport()
		if err != nil {
			log.Error(patroller.GetResourceType(), err)
			continue
		}

		if err := slack.PostWebhook(o.SlackWebhook, &report); err != nil {
			log.Error(patroller.GetResourceType(), err)
			continue
		}
		log.Debug(patroller.GetResourceType(), "success to send slack message")
	}
}

Finally, the resource implementation. Here is the implementation part for stopping the instance group of GKE. The file name will be gke_patroller.go.

The important thing is the contents of init (). Since init () is the first method to be called when loading a package in Go, you can return an instance when calling NewGCPResourcePatroller (ctx context.Context, projectID string) mentioned above by writing the process for initializing each resource here. Will be.

Below that is the process of actually acquiring the instance group of GKE and the process of stopping it. There is nothing special to mention, but I have listed it as an implementation example of the API around GKE.

func init() {
	registerPatroller("GKE", newGKEPatroller)
}
...

func (p *GKEPatroller) Scan() error {
	// get all clusters list
	clusters, err := container.NewProjectsLocationsClustersService(p.containerService).List("projects/" + p.projectID + "/locations/-").Do()
	if err != nil {
		log.Error(p.GetResourceType(), err)
		return err
	}

	for _, cluster := range clusters.Clusters {
		var instanceGroups []*GKEInstanceGroup

		for _, nodePool := range cluster.NodePools {
			instanceGroupUrlList := strings.Split(nodePool.InstanceGroupUrls[0], "/")
			instanceGroupName := instanceGroupUrlList[len(instanceGroupUrlList)-1]
			instanceGroupZone := instanceGroupUrlList[len(instanceGroupUrlList)-3]

			resp, err := p.computeService.InstanceGroups.Get(p.projectID, instanceGroupZone, instanceGroupName).Context(p.ctx).Do()
			if err != nil {
				log.Error(p.GetResourceType(), err)
				return err
			}

			if resp.Size > 0 {
				instanceGroups = append(instanceGroups, &GKEInstanceGroup{name: instanceGroupName, zone: instanceGroupZone})
			}
		}

		if len(instanceGroups) > 0 {
			c := &GKECluster{
				name:           cluster.Name,
				labels:         cluster.ResourceLabels,
				location:       cluster.Location,
				instanceGroups: instanceGroups,
			}
			p.results[scaned] = append(p.results[scaned], c)
			log.Info(p.GetResourceType(), "found running node "+c.name)
		}
	}

	return nil
}

func (p *GKEPatroller) TerminateOrDestroy(wg *sync.WaitGroup, scanedResult interface{}) {
	defer wg.Done()

	cluster := scanedResult.(*GKECluster)

	ms := compute.NewInstanceGroupManagersService(p.computeService)
	for _, instanceGroup := range cluster.instanceGroups {
		if _, err := ms.Resize(p.projectID, instanceGroup.zone, instanceGroup.name, 0).Do(); err != nil {
			log.Error(p.GetResourceType(), err)
			return
		}
		log.Info(p.GetResourceType(), instanceGroup.name+" called stop command")
	}

	p.results[terminated] = append(p.results[terminated], cluster)
}

The above is a very rough explanation of the code. I'm showing only a fairly partial code, so I think there were many points that were difficult to understand, but I hope you can get an idea of the implementation somehow.

Summary

I am grateful that by creating this mechanism, I forgot to stop GKE and the unfortunate thing of noticing the next day did not occur.

Also, I implemented it using Cloud Functions for the first time this time, and it was easier than I expected, including deployment. If you want to do the same thing on AWS, you'll probably have to implement Cloud Watch Events and Lambda in combination, but that's pretty much the same thing. So, if you have experience with AWS, I think that there are almost no problems.

On the other hand, as a reflection, I think it was not good that Cloud Functions was adopted easily. In this case, Cloud Functions and Cloud Run were two candidates for adoption because it would be good if they could be started at the specified time, but we adopted Cloud Functions because they are similar to Lambda. However, Cloud Run has more advantages in terms of ease of local testing, and there are no other disadvantages in particular, so I think that Cloud Run should have been selected from now on.

If you are interested, please see the article below for details on which one to use in cases where both Cloud Functions and Cloud Run can be used. https://medium.com/google-cloud/cloud-run-and-cloud-function-what-i-use-and-why-12bb5d3798e1

As mentioned above, it was "automatically stop unnecessary resources in the GCP verification environment".