[GO] [Continued] Due to the specifications of tcp, it is not known until the packet is actually sent whether the connection destination is closing the connection.

Introduction

Previously, there was an article [https://qiita.com/behiron/items/3719430e12cb770980f3] that [due to the tcp specifications, it is not possible to know whether the connection destination is closing the connection until the packet is actually sent]. I was there, but the reason was

Previously, when I threw SQL from the app to the DB, I got an error that the connection was invalid. The cause itself is very simple, it is just that the timeout setting to hold the connection on the server side (DB side) was shorter than the client, but "This causes an error when writing to the socket on the client library side. That's why you should handle it and use other connections held in the connection pool as you like !! "

was.

This happened when I used go's mysql driver, and the person inside GitHub had this problem last year. I am fixing it, and blog was written with that theme.

It was a great learning experience, so I would like to introduce it while supplementing the parts that are difficult to understand.

Read the blog

background

Three bugs in the Go MySQL Driver.

The background and other topics were also very interesting, so I will introduce some parts that deviate slightly from the main point.

It seems that GitHub's service was a Rails monolith, but over the last few years, it has been gradually rewritten with Golang, focusing on the parts that require speed and reliability.

One of them is a service called ʻauthzd` that went into operation in 2019, and it seems that it was the first service to connect to MySQL with a web application written in Go of GitHub.

The blog introduces the bugs that I experienced at that time based on the three PRs that GitHub fixed. This time, I will introduce the part of The crash that was introduced first.

By the way, it says resulting in our first“ 9 ”of availability for the service, so it seems that the service availability has exceeded 90% by fixing The crash.

I think there are places where you can raise service availability targets in your business, but OSS was the bottleneck, so it's great to fix OSS! !!

By the way, the screenshot attached to the blog is like a monitor of Datadog, so it seems that GitHub also uses Datadog (don't care).

The crash If I roughly write what the story is, if MySQL's server side ʻidle timeout` is shorter than that of the client, the connection was actually closed by the server side when trying to send a query from the client. Things can happen. In that case, the client will experience a forced error.

The simple solution to this problem is to make (* DB) .SetConnMaxLifetime smaller than the server's ʻidle timeout. However, since it is SetConnMaxLifetime and not SetIdleConnMaxLifetime, active connections instead of idle are closed unnecessarily, which is not cool. This is because not all DB server connections have the concept of ʻidle, so there seems to be a background that the database / sql side does not prepare.

I did exactly the above (for reference, DB's ʻidle timeout seems to be set to 8hby default in the case of AWS Aurora. GitHub sets it to30s`. It seems. It's short !!) And what can I do on the mysql driver side at that time? I thought that I made an article about what I investigated before, but it seems that he corrected it.

Now, let's get into the details.

The beginning of the article is almost as due to the tcp specifications, it is not known until the packet is actually sent once whether the connection destination is closing the connection. The same is written with the TCP transition diagram.

Due to the TCP specifications, even if the server sends a FIN packet, it only means that the server side does not write, and it is possible that the client writes to the server and the server reads and processes it. And there is no safe way in the tcp protocol to tell the client that the server does nothing to write or read (eg close a socket).

I will quote below because it is easy to understand, but the above characteristics of TCP do not seem to be a problem for most protocols, but the MySQL protocol has a flow that "the client sends and the server responds to it". The client doesn't seem to read until it writes.

In most network protocols on top of TCP, this isn’t an issue. The client is performing reads from the server, and as soon as it receives a [SYN, ACK], the next read returns an EOF error, because the Kernel knows that the server won’t write more data to this connection. However, as discussed earlier, once a MySQL connection is in its Command Phase, the MySQL protocol is client-managed. The client only reads from the server after it sends a request, because the server only sends data in response to requests from the client.

By the way, I think that this characteristic is the same for HTTP / 1.x (excluding pipelining), but before, [Understanding the mechanism of canceling http.Request of Go](https://qiita.com/behiron/ As I wrote in the article items / 9b6975de6ff470c71e06), Go's http server implementation creates a go routine that reads the socket when the request body is completely read, and notices the close on the client side during server processing //github.com/golang/go/blob/f92337422ef2ca27464c198bb3426d2dc4661653/src/net/http/server.go#L675-L727) This is the story on the server side.

Some of you may think that you should try again if there is an error after listening to the story so far. Actually, the retry mechanism is prepared in database / sql, and if you return ʻErrBadConn`, maxBadConnRetries (twice) will be retried, ** and if an error still occurs, the connection pool will not be used. Create a new connection in ** Implementation.

The following is an example of QueryContext, but every process of database / sql has a similar retry process, and the driver side (go mysql driver sql-driver / mysql) also) seems to have a case where database / sql / driver is ʻimport and driver.ErrBadConn` is returned.

database/sql/driver/driver.go


// ErrBadConn should be returned by a driver to signal to the sql
// package that a driver.Conn is in a bad state (such as the server
// having earlier closed the connection) and the sql package should
// retry on a new connection.
//
// To prevent duplicate operations, ErrBadConn should NOT be returned
// if there's a possibility that the database server might have
// performed the operation. Even if the server sends back an error,
// you shouldn't return ErrBadConn.
var ErrBadConn = errors.New("driver: bad connection")

database/sql/sql.go


// QueryContext executes a query that returns rows, typically a SELECT.
// The args are for any placeholder parameters in the query.
func (db *DB) QueryContext(ctx context.Context, query string, args ...interface{}) (*Rows, error) {
	var rows *Rows
	var err error
	for i := 0; i < maxBadConnRetries; i++ {
		rows, err = db.query(ctx, query, args, cachedOrNewConn)
		if err != driver.ErrBadConn {
			break
		}
	}
	if err == driver.ErrBadConn {
		return db.query(ctx, query, args, alwaysNewConn)
	}
	return rows, err
}

In the same way this time, if you try to return ʻErrBadConn, it will not be a problem in the first place (because the connection pool is not always used at the end even if the retry fails), but the place where the error is discovered is Since it is write` (unless you prepare some mechanism like Go's httpserver implementation, you will notice the server close for the first time with write), there seems to be a situation that you can not always retry safely.

The following cases featured on the blog are exactly the cases of To prevent duplicate operations, ErrBadConn should NOT be returned if there's a possibility that the database server might have performed the operation in the comments of ʻErrBadConn. Does not return.

What would happen if we performed an UPDATE in a perfectly healthy connection, MySQL executed it, and then our network went down before it could reply to us? The Go MySQL driver would also receive an EOF after a valid write. But if it were to return driver.ErrBadConn, database/sql would

Then, before writing, why not read with non-blocking and ʻErrBadConn` for EOF?

You might think, but that's exactly what PR does!

No, the situation is complicated. ..

Read PR

Let's actually read packets: Check connection liveness before writing query. I'm hungry just to grasp the revision policy in the previous chapter, but despite the small PR of about 100 lines, I learned a lot.

I would like to introduce three points that I learned.

Refer to the raw file descriptor when checking

All you have to do is to read the socket non-blocking just before write as we organized in the previous chapter and return ʻErrBadConn` if the server is already closed.

However, [Go's network processing provides a synchronous API as an API, but in fact it is non-blocking processing internally. ](Https://qiita.com/takc923/items/de68671ea889d8df6904#%E3%83%8D%E3%83%83%E3%83%88%E3%83%AF%E3%83%BC%E3%82 % AF% E5% 87% A6% E7% 90% 86% E3% 81% 97% E3% 81% 9F% E6% 99% 82)

Simply put, when the network waits for a mechanism called netpoller, the goroutine is disconnected from the original process and an event to the socket is asynchronously performed by a system call such as epoll. Go's runtime has a mechanism to grasp and reassign goroutine when it becomes processable (although I have never read the source of that part)

I think this is a really nice mechanism, but if you're sure it won't be blocked like this one, it's better to use a system call with a raw file descriptor. That's why the following implementation is implemented.

I think the reason I didn't explicitly set it to non-blocking is that the raw file descriptor is already specified as ʻO_NONBLOCK` on the Go runtime side.

conncheck.go


	sconn, ok := c.(syscall.Conn)
	if !ok {
		return nil
	}
	rc, err := sconn.SyscallConn()
	if err != nil {
		return err
	}
	rerr := rc.Read(func(fd uintptr) bool {
		n, err = syscall.Read(int(fd), buff[:])
		return true
	})
	switch {
	case rerr != nil:
		return rerr
	case n == 0 && err == nil:
		return io.EOF
	case n > 0:
		return errUnexpectedRead
	case err == syscall.EAGAIN || err == syscall.EWOULDBLOCK:
		return nil
	default:
		return err
	}

Check as few times as possible

ResetSession is defined in interface on the sql / driver side, and this process is called by sql / driver when returning a processed connection to the connection pool. This gives the implementing driver an opportunity to do the work.

In this PR, the flag provided for the connection in this interface implementation is turned on, the check is performed only when this flag is present at write, and the flag is turned off after the check.

As a result, the check will be made only when the connection acquired from the connection pool communicates for the first time. Wow! !!

database/sql/driver/driver.go


// SessionResetter may be implemented by Conn to allow drivers to reset the
// session state associated with the connection and to signal a bad connection.
type SessionResetter interface {
	// ResetSession is called while a connection is in the connection
	// pool. No queries will run on this connection until this method returns.
	//
	// If the connection is bad this should return driver.ErrBadConn to prevent
	// the connection from being returned to the connection pool. Any other
	// error will be discarded.
	ResetSession(ctx context.Context) error
}

Do nothing in windows

In PR, the operation has not been confirmed on windows, and there is no CI. How to check the operation I argued that, but I implemented the connCheck function in both the conncheck.go and conncheck_windows.go files with // + build! windows specified, and conncheck_windows. On the go side, we were proceeding with the discussion using the technique of just returning nil. This means that the windows side has been fixed without any changes.

Wow! !!

in conclusion

When I checked the PR, I explained it in quite detail when I first gave it, and I thought it was amazing that the impact on performance delay etc. was also verified. When I publicize with OSS, I feel that I tend to be low-profile, but I am confident in my corrections, pressured that the flow is slow, and said that such a serious issue remains. Even if it is broken, it is bad because it is actually merged as ok that will make a PR next week

The content was wonderful and I thought I should make an effort, so I introduced it.

Recommended Posts

[Continued] Due to the specifications of tcp, it is not known until the packet is actually sent whether the connection destination is closing the connection.