In a personal project with Go, which obtains information on financial assets from Bovespa.
The system makes intense use of concurrency and parallelism with goroutines, updating asset information (along with business calculations) every 8 seconds.
Initially, no errors or warnings appeared, but I noticed that some goroutines were taking longer than others to execute.
To be more specific, while the p99 time was at 0.03 ms, at some points, it increased to 0.9 ms. This led me to investigate the problem further.
I discovered that I was using a semaphore goroutine pool, which was created based on the GOMAXPROCS variable.
However, I realized there was a problem with this approach.
When we use the GOMAXPROCS variable, it does not correctly capture the number of cores available in the container. If the container has fewer available cores than the VM's total, it considers the VM's total. For example, my VM has 8 cores available, but the container only had 4. This resulted in 8 goroutines being created to run at the same time, causing throttling.
After much research overnight, I found a library developed by Uber that automatically adjusts the GOMAXPROCS variable more efficiently, regardless of whether it is in a container or not. This solution proved to be extremely stable and efficient: automaxprocs
Automatically set GOMAXPROCS to match Linux container CPU quota.
go get -u go.uber.org/automaxprocs
import _ "go.uber.org/automaxprocs"
func main() {
// Your application logic here.
}
Data measured from Uber's internal load balancer. We ran the load balancer with 200% CPU quota (i.e., 2 cores):
GOMAXPROCS | RPS | P50 (ms) | P99.9 (ms) |
---|---|---|---|
1 | 28,893.18 | 1.46 | 19.70 |
2 (equal to quota) | 44,715.07 | 0.84 | 26.38 |
3 | 44,212.93 | 0.66 | 30.07 |
4 | 41,071.15 | 0.57 | 42.94 |
8 | 33,111.69 | 0.43 | 64.32 |
Default (24) | 22,191.40 | 0.45 | 76.19 |
When GOMAXPROCS is increased above the CPU quota, we see P50 decrease slightly, but see significant increases to P99. We also see that the total RPS handled also decreases.
When GOMAXPROCS is higher than the CPU quota allocated, we also saw significant throttling:
$ cat /sys/fs/cgroup/cpu,cpuacct/system.slice/[...]/cpu.stat nr_periods 42227334 nr_throttled 131923 throttled_time 88613212216618
Once GOMAXPROCS was reduced to match the CPU quota, we saw no CPU throttling.
After implementing the use of this library, the problem was resolved, and now the p99 time remained at 0.02 ms constantly. This experience highlighted the importance of observability and profiling in concurrent systems.
The following is a very simple example, but one that demonstrates the difference in performance.
Using Go's native testing and benckmak package, I created two files:
benchmarking_with_enhancement_test.go:
package main import ( _ "go.uber.org/automaxprocs" "runtime" "sync" "testing" ) // BenchmarkWithEnhancement Função com melhoria, para adicionar o indice do loop em um array de inteiro func BenchmarkWithEnhancement(b *testing.B) { // Obtém o número de CPUs disponíveis numCPUs := runtime.NumCPU() // Define o máximo de CPUs para serem usadas pelo programa maxGoroutines := runtime.GOMAXPROCS(numCPUs) // Criação do semáforo semaphore := make(chan struct{}, maxGoroutines) var ( // Espera para grupo de goroutines finalizar wg sync.WaitGroup // Propriade mu sync.Mutex // Lista para armazenar inteiros list []int ) // Loop com mihão de indices for i := 0; ibenchmarking_without_enhancement_test.go:
package main import ( "runtime" "sync" "testing" ) // BenchmarkWithoutEnhancement Função sem a melhoria, para adicionar o indice do loop em um array de inteiro func BenchmarkWithoutEnhancement(b *testing.B) { // Obtém o número de CPUs disponíveis numCPUs := runtime.NumCPU() // Define o máximo de CPUs para serem usadas pelo programa maxGoroutines := runtime.GOMAXPROCS(numCPUs) // Criação do semáforo semaphore := make(chan struct{}, maxGoroutines) var ( // Espera para grupo de goroutines finalizar wg sync.WaitGroup // Propriade mu sync.Mutex // Lista para armazenar inteiros list []int ) // Loop com mihão de indices for i := 0; iThe difference between them is that one uses the Uber library import.
When running the benchmark assuming that 2 CPUs would be used, the result was:
ns/op: Provides an average in nanoseconds of how long it takes to perform a specific operation.
Note that the total available of my CPU is 8 cores, and that is what the runtime.NumCPU() property returned. However, as in running the benchmark, I defined that the use would be only two CPUs, and the file that did not use automaxprocs, defined that the execution limit at a time would be 8 goroutines, while the most efficient would be 2, because this way using less allocation makes execution more efficient.
So, the importance of observability and profiling of our applications is clear.
Disclaimer: All resources provided are partly from the Internet. If there is any infringement of your copyright or other rights and interests, please explain the detailed reasons and provide proof of copyright or rights and interests and then send it to the email: [email protected] We will handle it for you as soon as possible.
Copyright© 2022 湘ICP备2022001581号-3