Context expiration #3
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Seeing requests never produce responses in staging. Basically it means that requests from browsers time out. In the logs I'm seeing:
The general problem is that the context is canceled, which should come about because the browser disconnects.
Now I have a thorough picture of the codebase. Here's my analysis:
────────────────────────────────────────────────────────────────────────────────
Analysis: "Context Canceled" Cascade Failure
Root Cause Chain (most likely)
The problem is a cascading resource exhaustion, triggered by the combination of several design issues. The chain goes:
Critical Issues Found
🔴 1. No HTTP server timeouts — main.go:156-159
No ReadTimeout, WriteTimeout, IdleTimeout, or ReadHeaderTimeout are set. This is the single most impactful finding. Without these:
Go's default http.Server timeouts are all zero (disabled), so unless you set them explicitly, you're vulnerable.
🔴 2. No pgxpool connection limits — db/connection.go:209-220
No MaxConns, MinConns, MaxConnLifetime, or MaxConnIdleTime are configured. Depending on the pgx v5.7 default, this is either:
Either case is dangerous without timeouts. Stale connections also never get recycled (MaxConnLifetime = 0 = forever).
🟡 3. Pool-on-pool with database/sql — db/connection.go:224
stdlib.OpenDBFromPool wraps the pgxpool inside a *sql.DB. The sql.DB has its own defaults (MaxOpenConns=0, MaxIdleConns=2) that can interfere with pgxpool's connection lifecycle. This is a known sharp edge in pgx — the sql.DB wrapper
may hold connections in its idle pool that pgxpool thinks are returned, creating a disconnect.
🟡 4. Permanent connection held for LISTEN/NOTIFY — platform/start.go:158-163
One connection from the pool is permanently acquired for the entire process lifetime (waiting for PostgreSQL NOTIFY events). If the pool has only 4 connections (the default on a small VM), that's 25% of capacity gone.
🟡 5. SSE connections accumulate goroutines — api/event.go
Every browser tab that loads the dashboard opens an SSE connection (/api/events). Each holds a goroutine with a 5s ticker, plus a channel. The connectionsSSE map has no size limit and only cleans up on client disconnect. Multiple tabs
from the same user each get their own goroutine + channel.
🟡 6. Multiple sequential DB queries per "get report" request
platform/publicreport/report.go:byID() executes 4+ sequential DB queries in a single request:
Each query acquires and holds a pool connection. If any query is slow, the request holds a connection longer.
After getting locked, the new pprof debug endpoint shows:
Seems to be many, many instances of this. Notice the entrypoint is "getOAutheForOrg".
Root Cause: Database Connection Pool Exhaustion
The Mechanism (cascade of failures)
In /home/eliribble/src/nidus-sync/platform/start.go:173, listenAndDoOneJob calls db.PGInstance.PGXPool.Acquire(ctx) and then blocks indefinitely on conn.Conn().WaitForNotification(ctx). This connection is never released back to the pool
during normal operation. Goroutine 30 shows this — it's been in IO wait for 162 minutes holding a PostgreSQL connection for LISTEN new_job.
In /home/eliribble/src/nidus-sync/platform/arcgis.go:77-128, refreshFieldseekerData iterates over all orgs and all OAuth tokens and launches a goroutine for each — simultaneously and with no concurrency limiting:
The goroutine dump shows 89 periodicallyExportFieldseeker goroutines (IDs 55–176) and 1 maintainOAuth goroutine (ID 11) — all stuck trying to acquire a DB connection for 133 minutes.
periodicallyExportFieldseeker calls getOAuthForOrg → UsersFromOrganizationID → ExecuteManyTx → PGInstance.PGXPool.Query → pool.Acquire(ctx), which blocks on the puddle semaphore.
maintainOAuth calls OAuthTokenFromID → ExecuteOne → PGInstance.PGXPool.Query → same.
pgxpool defaults to max(4, runtime.NumCPU()). With listenAndDoOneJob permanently holding 1, only 3 connections remain for all other work.
The session store (alexedwards/scs/pgxstore) stores sessions in PostgreSQL. 9 goroutines (e.g., 1307, 1258) are HTTP request handlers stuck at pgxstore.FindCtx → pool.QueryRow → pool.Acquire. So all incoming requests hang.
In refreshFieldseekerData, a workerCtx with a cancel function is created for each iteration of the loop. But cancel() is only called from select blocks at the bottom:
Since neither channel fires during normal operation, cancel() is never called. Even though all the goroutines pass a context to semaphore.Acquire, the context never cancels, so they'll wait forever.
────────────────────────────────────────────────────────────────────────────────
Possible Causes (ordered by likelihood)
1. Unbounded goroutine spawning exhausting a tiny pool ← most likely
The combination of:
...means the 3 remaining connections are immediately exhausted. Once all connections are held, nobody can complete and release, creating a permanent deadlock.
2. Connections leaked by the BobDB / database/sql layer
The code creates a database/sql.DB via stdlib.OpenDBFromPool(db) with MaxOpenConns(0) (unlimited) and MaxIdleConns(0). The database/sql layer calls into pgxpool for each query. If database/sql holds connections open (e.g., due to a
driver bug, or a query that didn't close rows), those connections are permanently removed from the pgxpool's available slots.
3. pgxpool health check / connection lifecycle bug
pgxpool runs a background health check every minute (goroutine 38). If a connection is detected as bad by the health check and destroyed, there might be a race where the semaphore slot isn't properly released, causing a permanent drain
of available slots.
4. pond.NewResultPool(20) adding contention
exportFieldseekerData creates a pond.NewResultPoolSyncStats worker pool. Even though exportFieldseekerLayer currently returns immediately (empty body), if any goroutine path triggers DB operations through BobDB (e.g.,
InsertFieldseekerSyncs, updateSummaryTables), the 20 pond workers would also compete for DB connections.
────────────────────────────────────────────────────────────────────────────────
The Deadlock Timeline
┌─────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time │ Event │
├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 0 min │ Server starts. listenAndDoOneJob acquires 1 DB connection (holds permanently). │
├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ ~29 min │ refreshFieldseekerData launches ~90 periodicallyExportFieldseeker goroutines + ~1 maintainOAuth. All try to acquire DB connections at once. 3 get through immediately. │
├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ ~29 min │ Those 3 complete their queries and release. 3 more acquire. This cycling continues briefly. │
├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 133 min │ At some point, the 3 remaining slots get permanently occupied (likely by BobDB/database/sql operations or goroutines that don't release). All ~94 goroutines are now permanently stuck. HTTP requests start hanging. │
├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 162 min │ You dump goroutines. The deadlock is fully visible. │
└─────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
────────────────────────────────────────────────────────────────────────────────
Recommended Fixes
directly.
I think this is fixed in
44c0356bccStarted hitting this again on staging just now. Grabbed a pprof-3.log. Need to analyze it.
I hit it again, almost immediately after restarting the server on staging:
I took pprof-4.log, need to analyze that one too.
Tzipporah wedged the server again, she says by making a nuisance report, typing in an address "pointe noire department" which is a POI and selecting it, not waiting for it to populate, and then submitting the report.
The staging log since bootup says:
I also created pprof-5.log for this one.
Attached video of Tzipporah reproducing the issue after a reload in staging. The log at the time is:
and I captured pprof-7.log at the time