Go’s goroutines feel like magic. You launch thousands of them, they work, they scale. Until they do not. Understanding the runtime scheduler explains why goroutines behave the way they do, helps you tune GOMAXPROCS correctly, and gives you the mental model to debug performance problems that otherwise look inexplicable.
The GMP Model
The Go scheduler uses three components: G, M, and P.
G (Goroutine): The unit of work. Contains the function, its stack (starting at 2KB, growing as needed), and scheduling information. You create Gs with go func().
M (Machine/OS Thread): An actual OS thread. Go creates M for each goroutine that needs CPU. The runtime caps M at 10,000 by default (adjustable with SetMaxThreads).
P (Processor): The logical CPU. Holds a local run queue of goroutines ready to run. The number of Ps equals GOMAXPROCS (default: number of CPU cores). Each P has a local queue of ~256 Gs plus access to a global queue.
The relationship: each P has exactly one M running it. An M needs a P to run Gs. When a P’s local queue is empty, it steals from other Ps’ queues.
G - G - G G - G - G
| |
P <--> M P <--> M
| |
Go scheduler (global queue)
Why This Matters: Context Switching Is Cheap
OS thread context switching costs 1-10 microseconds (saving registers, flushing TLB entries). Goroutine context switching costs around 0.1-0.3 microseconds - it is software-level, managed by the Go runtime.
This is why you can run 100,000 goroutines without the overhead that 100,000 OS threads would create. The OS sees GOMAXPROCS threads. The Go runtime multiplexes goroutines onto those threads.
Preemption: Cooperative vs Cooperative-ish
Early Go versions used purely cooperative scheduling. A goroutine yielded only at function calls. A tight loop with no function calls could block all other goroutines on that P.
// This was a problem in early Go
go func() {
for {
// Tight loop - never yields
}
}()
Go 1.14 added asynchronous preemption using signals. The runtime sends SIGURG to goroutines, forcing a yield at any safe point. Tight loops are no longer a problem in modern Go. The scheduler is “cooperative-ish” - goroutines primarily yield at blocking operations, but the runtime can preempt them when needed.
What Causes Goroutine Blocking
When a goroutine blocks, its behavior depends on why:
Network I/O: Go uses the network poller (epoll on Linux, kqueue on macOS). A goroutine waiting for network data is parked without blocking the M. The M continues running other goroutines. When the data arrives, the goroutine is moved back to a run queue.
Syscalls: Some syscalls cannot be made non-blocking. When a goroutine makes a blocking syscall, its M is detached from its P. The P finds (or creates) another M to keep running. The goroutine and its M return to the pool when the syscall completes.
Channel operations / mutex lock: The goroutine is moved to a wait queue. Its M immediately picks up the next goroutine from the P’s run queue.
This is why Go’s goroutines scale: blocking in most cases does not block the OS thread.
GOMAXPROCS: Tuning Considerations
GOMAXPROCS defaults to the number of CPU cores. This is right for CPU-bound work. For I/O-heavy workloads, the optimal value depends on your blocking pattern.
import "runtime"
runtime.GOMAXPROCS(8) // Set programmatically
Or via environment:
GOMAXPROCS=4 ./myserver
In containerized environments, GOMAXPROCS reads the host’s CPU count, not the container’s CPU limit. This causes Go processes in containers to create more OS threads than the container has CPU quota for, causing throttling. Use the automaxprocs library to fix this:
import _ "go.uber.org/automaxprocs"
// Automatically sets GOMAXPROCS to match container CPU quota
This is one of the most common production performance issues in containerized Go applications.
Goroutine Leaks: The Common Anti-Pattern
Goroutines that block forever leak. They consume memory (stack starts at 2KB, grows with usage) and do not return to the pool.
// LEAK: channel read blocks forever if no sender exists
go func() {
val := <-ch // If nothing ever sends, this goroutine leaks
process(val)
}()
// FIX: use select with a done channel or context
go func() {
select {
case val := <-ch:
process(val)
case <-ctx.Done():
return
}
}()
Detecting goroutine leaks in production: expose runtime.NumGoroutine() as a metric. A number that grows monotonically over time indicates leaks. The goleak library is useful in tests.
Work Stealing: Why Load Balances
When a P’s local run queue is empty, it does not wait. It steals half the goroutines from another P’s run queue. This is the mechanism that distributes work automatically across processors.
Implication: you do not need to manually distribute goroutines across workers in most cases. Launching goroutines freely and letting the scheduler distribute them is idiomatic Go.
The Garbage Collector’s Impact on Scheduling
The GC uses goroutines and has a stop-the-world (STW) phase, though it is extremely short in modern Go (sub-millisecond for most workloads). During GC, all goroutines are paused briefly while GC roots are scanned.
High allocation rates increase GC pressure. For latency-sensitive services, use sync.Pool to reuse allocations and profile allocation sites with go tool pprof.
Bottom Line
The GMP model explains why Go can run millions of goroutines efficiently, why container CPU limits need the automaxprocs fix, and why goroutine leaks are a memory issue not a thread issue. You do not need to know this to write Go, but when something goes wrong at scale - goroutine count growing, unexplained latency spikes, GC pressure - the scheduler model is where the answers live. Spend an hour reading the scheduler source or watching Jaana Dogan’s scheduler talks. It is time that compounds.
Comments