Two major bugs in golang flow restriction database (no one mentioned it in half a year)

uber-go/ratelimit library

I used to use the juju/ratelimit[2] stream limiting library, but I don’t like the complex "constructor" of this library. Later, after trying the uber-go/ratelimit[3] library, I felt that the SDK design is relatively simple and it is good to use, so I have been using it all the time. The version at that time wasv0.2.0, and I won't set it upslackparameters, so we are safe.

Recently, my colleague updated this library to the latest one when he was working on a project.v0.3.0,It was found that after a period of time of packet transmission, the sudden current limiting did not work, and the packet transmission frequency surged, causing the program panic.

Reproduction through unit test

It is easy to reproduce this problem through one of the following unit tests:

func TestLimiter(t *) {
 limiter := (1, (), (1))
 for i := 0; i < 25; i++ {
  if i == 1 {
   (2 * )
  }
  ()
  (().Unix(), i) // burst
 }
}

There is a problem with the slack's judgment logic

This unit test attempts to not call the current limiter in the second cycle, giving it a chance to enter the logic of slack judgment. The original intention of this library's slack design is to leave some room on the basis of rate, and not strictly limit the current according to rate, but becausev0.3.0The code problem has caused problems with the slack's judgment logic:

func (t *atomicInt64Limiter) Take()  {
 var (
  newTimeOfNextPermissionIssue int64
  now                          int64
 )
 for {
  now = ().UnixNano()
  timeOfNextPermissionIssue := atomic.LoadInt64(&)
  switch {
  case timeOfNextPermissionIssue == 0 || ( == 0 && now-timeOfNextPermissionIssue > int64()):
   // if this is our first call or  == 0 we need to shrink issue time to now
   newTimeOfNextPermissionIssue = now
  case  > 0 && now-timeOfNextPermissionIssue > int64():
   // a lot of nanoseconds passed since the last Take call
   // we will limit max accumulated time to maxSlack
   newTimeOfNextPermissionIssue = now - int64()
  default:
   // calculate the time at which our permission was issued
   newTimeOfNextPermissionIssue = timeOfNextPermissionIssue + int64()
  }
  if atomic.CompareAndSwapInt64(&, timeOfNextPermissionIssue, newTimeOfNextPermissionIssue) {
   break
  }
 }
 sleepDuration := (newTimeOfNextPermissionIssue - now)
 if sleepDuration > 0 {
  (sleepDuration)
  return (0, newTimeOfNextPermissionIssue)
 }
 // return now if we don't sleep as atomicLimiter does
 return (0, now)
}

Principle analysis

Once enteredcase > 0 && now-timeOfNextPermissionIssue > int64():You will find subsequent calls to this branchTakeBasically, they will enter this branch, the program will not block, just call itTakeNo blocking. You can see that this branch will only enter when slack>0 is set, which is exactly the default slack=10. This bug can also be calculated. Assuming that the current branch is now1, then this time Take willnewTimeOfNextPermissionIssueSet tonow1-int64()。

Next, call Take, the current time is now2, now2 will always be a little larger than now1, at least a few nanoseconds larger. At this time, we calculate the conditions of the branchnow-timeOfNextPermissionIssue > int64(), this condition is definitely valid, becausenow2-(now1-int64()) = (now2-now1) + int64() > int64(). This leads to every subsequent Take entering this branch without blocking, causing the program to be packaged wildly, and eventually panic.

On the weekend, I asked a bug for this project, and one of its maintainers fixed it, but the main developer of this project has already dealt with this.v0.3.0The implementation lost confidence because this implementation had a similar bug. After it was rolled back by him, it was later fixed and then merged. Now there is a bug.

Regardless of whether the author repairs or not, you must pay attention to using this libraryv0.3.0Be careful, you may step on this mine.

One of the big bugs.

In fact, we are not so concerned about whether there is slack, so we use itIsn't it okay to set slack to 0 for this option?

Well, yes, there will be no more bugs on the above, and there will be every problem with unit tests running on my mac laptop, but! But! But! Another bug has appeared.

We modify the current limit rate to5000, As a result, on Linux testing machines, they can only run close to each other2000, far less than expected, so how can this limit the current? The current cannot be hit at all.

My colleague saidratelimitVersion down tov0.2.0, do not set itslack=0This problem can be solved.

This is very strange. After some investigation, it was found that the problem might be in the Go standard library.superior.

We useIf you sleep for 50 microseconds, before Go 1.16, the Linux machine will basically actually sleep for 80 or 90 microseconds, but after Go 1.16, the Linux machine will have a huge gap. On Windows machines, before Go 1.16, it will have a huge gap. After Go 1.16, it will have a huge gap. I tested the machine on Apple's MacPro M1 and there was no such problem.

This bug is recorded in issues#44343[4]. It has been almost three years since it was proposed in February 2021. This bug has not been closed yet and the problem has always existed. It seems that this bug is not so easy to find the root cause and completely solve it.

So if you want to use, Please remember that in Linux environment, its accuracy is also in1msabout. soratelimitIf the library relies on it to perform 5000 current limit, if it is not designed properly, it will not achieve the current limit effect.

Let's summarize

If you use uber-go/ratelimit[5], remember:

Use older versionsv0.2.0
Don't set itslack=0, default or set a non-zero value

Actually, I'vejuju/ratelimitSwitch touber-go/ratelimitThere is another fundamental reason.juju/ratelimitis based on the current limit of the token bucket, anduber-go/ratelimitCurrent limit based on leaky bucket, oruber-go/ratelimitIt's more like shaping, which is more in line with the scenario we use. We want to send data packets at a constant speed, but do not want Burst or sudden rate changes. Our scenario is more focused on uniform speed.

Of course, you can also use juju/ratelimit[6], which is a current limiting library contributed by Canonical. The copyright is LGPL 3.0 + more suitable terms for Go, which is also the unified authorization of Canonical for their Go projects. It is a token-based stream limiting library, which is actually OK to use, but it has not been updated in 4 years. One thing that I don't feel very good about is that it fills the bucket when it initializes. The result is that the speed of using this bucket to obtain tokens may exceed your expectations at the beginning, which may lead to the fast packetization speed at the beginning, and then slowly reaching a constant speed. This is not the effect I want, but I modify it every time, so I fork this project smallnest/ratelimit[7]. When initializing the current limiter, you can set the initial token, such as setting the initial token to zero.

Currently, Go official also provides an extension library /x/time/rate[8], which has more powerful functions. The negative effect brought by power is that it is more complicated to use, and the effects brought by complexity may bring some potential errors, but it can also be used after careful evaluation and testing.

References

[1]

uber-go/ratelimit: /uber-go/ratelimit

[2]

juju/ratelimit: /juju/ratelimit

[3]

uber-go/ratelimit: /uber-go/ratelimit/uber-go/ratelimit

[4]

issues#44343: /golang/go/issues/44343

[5]

uber-go/ratelimit: /uber-go/ratelimit

[6]

juju/ratelimit: /juju/ratelimit

[7]

smallnest/ratelimit: /smallnest/ratelimit

[8]

/x/time/rate: //x/time/rate

There are also some third libraries that are not so popular, including some current limiting libraries implemented using sliding windows, and also distributed current limiting libraries. If you want to know more, please follow my other related articles!