Context expiration #3

Open
opened 2026-05-18 23:32:39 +00:00 by eliribble · 9 comments
Owner

Seeing requests never produce responses in staging. Basically it means that requests from browsers time out. In the logs I'm seeing:

May 18 23:29:36 nocix-amd-legacy-quadcore nidus-sync[797940]: 11:29PM WRN Responding with an error from api error="get report 'CV4DBHUEUW9P': query report 'CV4DBHUEUW9P': query: txn query: context canceled" status=500 user message="get report 'CV4DBHUEUW9P': query report 'CV4DBHUEUW9P': query: txn query: context canceled"

The general problem is that the context is canceled, which should come about because the browser disconnects.

Seeing requests never produce responses in staging. Basically it means that requests from browsers time out. In the logs I'm seeing: ``` May 18 23:29:36 nocix-amd-legacy-quadcore nidus-sync[797940]: 11:29PM WRN Responding with an error from api error="get report 'CV4DBHUEUW9P': query report 'CV4DBHUEUW9P': query: txn query: context canceled" status=500 user message="get report 'CV4DBHUEUW9P': query report 'CV4DBHUEUW9P': query: txn query: context canceled" ``` The general problem is that the context is canceled, which should come about because the browser disconnects.
Author
Owner

Now I have a thorough picture of the codebase. Here's my analysis:

────────────────────────────────────────────────────────────────────────────────

Analysis: "Context Canceled" Cascade Failure

Root Cause Chain (most likely)

The problem is a cascading resource exhaustion, triggered by the combination of several design issues. The chain goes:

  1. A request handler executes a slow DB query (or an external API call)
  2. The handler blocks, holding a connection from the pgxpool
  3. More requests arrive, each needing a DB connection
  4. The pgxpool is exhausted → new requests block on pool.Acquire()
  5. Browsers time out waiting → request contexts get canceled
  6. All queued requests fail with "context canceled"
  7. The system appears dead until the stuck handler releases its connection

Critical Issues Found

🔴 1. No HTTP server timeouts — main.go:156-159

  server := &http.Server{
      Addr:    config.Bind,
      Handler: r,
  }

No ReadTimeout, WriteTimeout, IdleTimeout, or ReadHeaderTimeout are set. This is the single most impactful finding. Without these:

  • A handler that blocks forever will hold a goroutine + TCP connection forever
  • Keep-alive connections from browsers never time out, accumulating idle goroutines
  • There's no backpressure — slow handlers are never interrupted by the server
  • After N page loads, all available goroutines/connections can be tied up

Go's default http.Server timeouts are all zero (disabled), so unless you set them explicitly, you're vulnerable.

🔴 2. No pgxpool connection limits — db/connection.go:209-220

  config, err := pgxpool.ParseConfig(uri)
  // ... config.AfterConnect set, but nothing else ...
  db, err := pgxpool.NewWithConfig(ctx, config)

No MaxConns, MinConns, MaxConnLifetime, or MaxConnIdleTime are configured. Depending on the pgx v5.7 default, this is either:

  • max(4, runtime.NumCPU()) — small and easily exhausted
  • 0 (unlimited) — grows unbounded, exhausting PostgreSQL server connections

Either case is dangerous without timeouts. Stale connections also never get recycled (MaxConnLifetime = 0 = forever).

🟡 3. Pool-on-pool with database/sql — db/connection.go:224

  bobDB := bob.NewDB(stdlib.OpenDBFromPool(db))

stdlib.OpenDBFromPool wraps the pgxpool inside a *sql.DB. The sql.DB has its own defaults (MaxOpenConns=0, MaxIdleConns=2) that can interfere with pgxpool's connection lifecycle. This is a known sharp edge in pgx — the sql.DB wrapper
may hold connections in its idle pool that pgxpool thinks are returned, creating a disconnect.

🟡 4. Permanent connection held for LISTEN/NOTIFY — platform/start.go:158-163

  conn, err := db.PGInstance.PGXPool.Acquire(ctx)
  // ...
  defer conn.Release()
  // ...
  notification, err := conn.Conn().WaitForNotification(ctx)

One connection from the pool is permanently acquired for the entire process lifetime (waiting for PostgreSQL NOTIFY events). If the pool has only 4 connections (the default on a small VM), that's 25% of capacity gone.

🟡 5. SSE connections accumulate goroutines — api/event.go

Every browser tab that loads the dashboard opens an SSE connection (/api/events). Each holds a goroutine with a 5s ticker, plus a channel. The connectionsSSE map has no size limit and only cleans up on client disconnect. Multiple tabs
from the same user each get their own goroutine + channel.

🟡 6. Multiple sequential DB queries per "get report" request

platform/publicreport/report.go:byID() executes 4+ sequential DB queries in a single request:

  1. ReportFromPublicID (Jet via pgxpool)
  2. loadImagesForReport (Bob via sql.DB)
  3. logEntriesByReportID (Bob — 2 more queries for non-public)
  4. AddressesFromIDs (Jet via pgxpool)

Each query acquires and holds a pool connection. If any query is slow, the request holds a connection longer.

Now I have a thorough picture of the codebase. Here's my analysis: ──────────────────────────────────────────────────────────────────────────────── Analysis: "Context Canceled" Cascade Failure ### Root Cause Chain (most likely) The problem is a cascading resource exhaustion, triggered by the combination of several design issues. The chain goes: 1. A request handler executes a slow DB query (or an external API call) 2. The handler blocks, holding a connection from the pgxpool 3. More requests arrive, each needing a DB connection 4. The pgxpool is exhausted → new requests block on pool.Acquire() 5. Browsers time out waiting → request contexts get canceled 6. All queued requests fail with "context canceled" 7. The system appears dead until the stuck handler releases its connection ### Critical Issues Found #### 🔴 1. No HTTP server timeouts — main.go:156-159 ```go server := &http.Server{ Addr: config.Bind, Handler: r, } ``` No ReadTimeout, WriteTimeout, IdleTimeout, or ReadHeaderTimeout are set. This is the single most impactful finding. Without these: - A handler that blocks forever will hold a goroutine + TCP connection forever - Keep-alive connections from browsers never time out, accumulating idle goroutines - There's no backpressure — slow handlers are never interrupted by the server - After N page loads, all available goroutines/connections can be tied up Go's default http.Server timeouts are all zero (disabled), so unless you set them explicitly, you're vulnerable. #### 🔴 2. No pgxpool connection limits — db/connection.go:209-220 ```go config, err := pgxpool.ParseConfig(uri) // ... config.AfterConnect set, but nothing else ... db, err := pgxpool.NewWithConfig(ctx, config) ``` No MaxConns, MinConns, MaxConnLifetime, or MaxConnIdleTime are configured. Depending on the pgx v5.7 default, this is either: - max(4, runtime.NumCPU()) — small and easily exhausted - 0 (unlimited) — grows unbounded, exhausting PostgreSQL server connections Either case is dangerous without timeouts. Stale connections also never get recycled (MaxConnLifetime = 0 = forever). #### 🟡 3. Pool-on-pool with database/sql — db/connection.go:224 ```go bobDB := bob.NewDB(stdlib.OpenDBFromPool(db)) ``` stdlib.OpenDBFromPool wraps the pgxpool inside a *sql.DB. The sql.DB has its own defaults (MaxOpenConns=0, MaxIdleConns=2) that can interfere with pgxpool's connection lifecycle. This is a known sharp edge in pgx — the sql.DB wrapper may hold connections in its idle pool that pgxpool thinks are returned, creating a disconnect. #### 🟡 4. Permanent connection held for LISTEN/NOTIFY — platform/start.go:158-163 ```go conn, err := db.PGInstance.PGXPool.Acquire(ctx) // ... defer conn.Release() // ... notification, err := conn.Conn().WaitForNotification(ctx) ``` One connection from the pool is permanently acquired for the entire process lifetime (waiting for PostgreSQL NOTIFY events). If the pool has only 4 connections (the default on a small VM), that's 25% of capacity gone. #### 🟡 5. SSE connections accumulate goroutines — api/event.go Every browser tab that loads the dashboard opens an SSE connection (/api/events). Each holds a goroutine with a 5s ticker, plus a channel. The connectionsSSE map has no size limit and only cleans up on client disconnect. Multiple tabs from the same user each get their own goroutine + channel. #### 🟡 6. Multiple sequential DB queries per "get report" request platform/publicreport/report.go:byID() executes 4+ sequential DB queries in a single request: 1. ReportFromPublicID (Jet via pgxpool) 2. loadImagesForReport (Bob via sql.DB) 3. logEntriesByReportID (Bob — 2 more queries for non-public) 4. AddressesFromIDs (Jet via pgxpool) Each query acquires and holds a pool connection. If any query is slow, the request holds a connection longer.
Author
Owner

After getting locked, the new pprof debug endpoint shows:

golang.org/x/sync/semaphore.(*Weighted).Acquire(0xc0005b4460, {0x404d5b8, 0xc002a24640}, 0x1)
        golang.org/x/sync@v0.19.0/semaphore/semaphore.go:74 +0x333
github.com/jackc/puddle/v2.(*Pool[...]).acquire(0x40b3700, {0x404d5b8, 0xc002a24640})
        github.com/jackc/puddle/v2@v2.2.2/pool.go:360 +0xb3
github.com/jackc/puddle/v2.(*Pool[...]).Acquire(0x40b3700, {0x404d5b8, 0xc002a24640})
        github.com/jackc/puddle/v2@v2.2.2/pool.go:347 +0x89
github.com/jackc/pgx/v5/pgxpool.(*Pool).Acquire(0xc00031c270, {0x404d5b8?, 0xc002a24640?})
        github.com/jackc/pgx/v5@v5.7.6/pgxpool/pool.go:595 +0x148
github.com/jackc/pgx/v5/stdlib.connector.Connect({{{{0x0, 0x0}, 0x0, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, 0x0, ...}, ...}, ...}, ...)
        github.com/jackc/pgx/v5@v5.7.6/stdlib/sql.go:266 +0x18d
database/sql.(*DB).conn(0xc00031c340, {0x404d5b8, 0xc002a24640}, 0x1)
        database/sql/sql.go:1431 +0x87d
database/sql.(*DB).query(0xc00031c340, {0x404d5b8, 0xc002a24640}, {0xc00644d500, 0x318}, {0xc006b16790, 0x1, 0x1}, 0x0?)
        database/sql/sql.go:1759 +0x57
database/sql.(*DB).QueryContext.func1(0x0?)
        database/sql/sql.go:1742 +0x4f
database/sql.(*DB).retry(0x0?, 0xc005d53bb0)
        database/sql/sql.go:1576 +0x42
database/sql.(*DB).QueryContext(0x48145a?, {0x404d5b8?, 0xc002a24640?}, {0xc00644d500?, 0x1?}, {0xc006b16790?, 0x4029e70?, 0x5d583e0?})
        database/sql/sql.go:1741 +0xc5
github.com/Gleipnir-Technology/bob.DB.QueryContext({0x24d8dc0?}, {0x404d5b8?, 0xc002a24640?}, {0xc00644d500?, 0xc005d53cf8?}, {0xc006b16790?, 0xc005d53cc8?, 0x
d07b1a?})
        github.com/Gleipnir-Technology/bob@v0.42.5/stdlib.go:44 +0x27
github.com/stephenafamo/scan.All[...]({0x404d5b8, 0xc002a24640}, {0x77a2182a0060, 0xc00031c340?}, 0xc000278a50?, {0xc00644d500, 0xc00030edb8?}, {0xc006b16790,
0x1, 0x1})
        github.com/stephenafamo/scan@v0.7.0/exec.go:50 +0x9d
github.com/Gleipnir-Technology/bob.Allx[...]({0x404d5b8, 0xc002a24640}, {0x4026928, 0xc00031c340}, {0x40430d8, 0xc00797de90}, 0xc000278a50)
        github.com/Gleipnir-Technology/bob@v0.42.5/exec.go:167 +0x2f5
github.com/Gleipnir-Technology/bob/orm.Query[...].All(...)
        github.com/Gleipnir-Technology/bob@v0.42.5/orm/query.go:68
github.com/Gleipnir-Technology/nidus-sync/platform.getOAuthForOrg({0x404d5b8, 0xc002a24640}, 0x0?)
        github.com/Gleipnir-Technology/nidus-sync/platform/arcgis.go:58 +0xbb
github.com/Gleipnir-Technology/nidus-sync/platform.periodicallyExportFieldseeker({0x404d5b8, 0xc002a24640}, 0xc0071ace08)
        github.com/Gleipnir-Technology/nidus-sync/platform/arcgis.go:486 +0xda
github.com/Gleipnir-Technology/nidus-sync/platform.refreshFieldseekerData.func2()
        github.com/Gleipnir-Technology/nidus-sync/platform/arcgis.go:119 +0x53
created by github.com/Gleipnir-Technology/nidus-sync/platform.refreshFieldseekerData in goroutine 16
        github.com/Gleipnir-Technology/nidus-sync/platform/arcgis.go:117 +0x7c5

Seems to be many, many instances of this. Notice the entrypoint is "getOAutheForOrg".

After getting locked, the new pprof debug endpoint shows: ``` golang.org/x/sync/semaphore.(*Weighted).Acquire(0xc0005b4460, {0x404d5b8, 0xc002a24640}, 0x1) golang.org/x/sync@v0.19.0/semaphore/semaphore.go:74 +0x333 github.com/jackc/puddle/v2.(*Pool[...]).acquire(0x40b3700, {0x404d5b8, 0xc002a24640}) github.com/jackc/puddle/v2@v2.2.2/pool.go:360 +0xb3 github.com/jackc/puddle/v2.(*Pool[...]).Acquire(0x40b3700, {0x404d5b8, 0xc002a24640}) github.com/jackc/puddle/v2@v2.2.2/pool.go:347 +0x89 github.com/jackc/pgx/v5/pgxpool.(*Pool).Acquire(0xc00031c270, {0x404d5b8?, 0xc002a24640?}) github.com/jackc/pgx/v5@v5.7.6/pgxpool/pool.go:595 +0x148 github.com/jackc/pgx/v5/stdlib.connector.Connect({{{{0x0, 0x0}, 0x0, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, 0x0, ...}, ...}, ...}, ...) github.com/jackc/pgx/v5@v5.7.6/stdlib/sql.go:266 +0x18d database/sql.(*DB).conn(0xc00031c340, {0x404d5b8, 0xc002a24640}, 0x1) database/sql/sql.go:1431 +0x87d database/sql.(*DB).query(0xc00031c340, {0x404d5b8, 0xc002a24640}, {0xc00644d500, 0x318}, {0xc006b16790, 0x1, 0x1}, 0x0?) database/sql/sql.go:1759 +0x57 database/sql.(*DB).QueryContext.func1(0x0?) database/sql/sql.go:1742 +0x4f database/sql.(*DB).retry(0x0?, 0xc005d53bb0) database/sql/sql.go:1576 +0x42 database/sql.(*DB).QueryContext(0x48145a?, {0x404d5b8?, 0xc002a24640?}, {0xc00644d500?, 0x1?}, {0xc006b16790?, 0x4029e70?, 0x5d583e0?}) database/sql/sql.go:1741 +0xc5 github.com/Gleipnir-Technology/bob.DB.QueryContext({0x24d8dc0?}, {0x404d5b8?, 0xc002a24640?}, {0xc00644d500?, 0xc005d53cf8?}, {0xc006b16790?, 0xc005d53cc8?, 0x d07b1a?}) github.com/Gleipnir-Technology/bob@v0.42.5/stdlib.go:44 +0x27 github.com/stephenafamo/scan.All[...]({0x404d5b8, 0xc002a24640}, {0x77a2182a0060, 0xc00031c340?}, 0xc000278a50?, {0xc00644d500, 0xc00030edb8?}, {0xc006b16790, 0x1, 0x1}) github.com/stephenafamo/scan@v0.7.0/exec.go:50 +0x9d github.com/Gleipnir-Technology/bob.Allx[...]({0x404d5b8, 0xc002a24640}, {0x4026928, 0xc00031c340}, {0x40430d8, 0xc00797de90}, 0xc000278a50) github.com/Gleipnir-Technology/bob@v0.42.5/exec.go:167 +0x2f5 github.com/Gleipnir-Technology/bob/orm.Query[...].All(...) github.com/Gleipnir-Technology/bob@v0.42.5/orm/query.go:68 github.com/Gleipnir-Technology/nidus-sync/platform.getOAuthForOrg({0x404d5b8, 0xc002a24640}, 0x0?) github.com/Gleipnir-Technology/nidus-sync/platform/arcgis.go:58 +0xbb github.com/Gleipnir-Technology/nidus-sync/platform.periodicallyExportFieldseeker({0x404d5b8, 0xc002a24640}, 0xc0071ace08) github.com/Gleipnir-Technology/nidus-sync/platform/arcgis.go:486 +0xda github.com/Gleipnir-Technology/nidus-sync/platform.refreshFieldseekerData.func2() github.com/Gleipnir-Technology/nidus-sync/platform/arcgis.go:119 +0x53 created by github.com/Gleipnir-Technology/nidus-sync/platform.refreshFieldseekerData in goroutine 16 github.com/Gleipnir-Technology/nidus-sync/platform/arcgis.go:117 +0x7c5 ``` Seems to be many, many instances of this. Notice the entrypoint is "getOAutheForOrg".
Author
Owner

Root Cause: Database Connection Pool Exhaustion

The Mechanism (cascade of failures)

  1. listenAndDoOneJob permanently holds one pool connection

In /home/eliribble/src/nidus-sync/platform/start.go:173, listenAndDoOneJob calls db.PGInstance.PGXPool.Acquire(ctx) and then blocks indefinitely on conn.Conn().WaitForNotification(ctx). This connection is never released back to the pool
during normal operation. Goroutine 30 shows this — it's been in IO wait for 162 minutes holding a PostgreSQL connection for LISTEN new_job.

  1. refreshFieldseekerData launches unbounded goroutines

In /home/eliribble/src/nidus-sync/platform/arcgis.go:77-128, refreshFieldseekerData iterates over all orgs and all OAuth tokens and launches a goroutine for each — simultaneously and with no concurrency limiting:

  for _, oauth := range oauths {
      go func() { maintainOAuth(workerCtx, &oauth) }()
  }
  for _, org := range orgs {
      go func() { periodicallyExportFieldseeker(workerCtx, org) }()
  }

The goroutine dump shows 89 periodicallyExportFieldseeker goroutines (IDs 55–176) and 1 maintainOAuth goroutine (ID 11) — all stuck trying to acquire a DB connection for 133 minutes.

  1. Every goroutine immediately hits the DB on startup

periodicallyExportFieldseeker calls getOAuthForOrg → UsersFromOrganizationID → ExecuteManyTx → PGInstance.PGXPool.Query → pool.Acquire(ctx), which blocks on the puddle semaphore.

maintainOAuth calls OAuthTokenFromID → ExecuteOne → PGInstance.PGXPool.Query → same.

  1. The pgxpool defaults to max 4 connections

pgxpool defaults to max(4, runtime.NumCPU()). With listenAndDoOneJob permanently holding 1, only 3 connections remain for all other work.

  1. HTTP request session lookups also need DB connections → blocked

The session store (alexedwards/scs/pgxstore) stores sessions in PostgreSQL. 9 goroutines (e.g., 1307, 1258) are HTTP request handlers stuck at pgxstore.FindCtx → pool.QueryRow → pool.Acquire. So all incoming requests hang.

  1. No context cancellation to break the deadlock

In refreshFieldseekerData, a workerCtx with a cancel function is created for each iteration of the loop. But cancel() is only called from select blocks at the bottom:

  select {
  case <-ctx.Done():       // only fires on server shutdown
      cancel()
      wg.Wait()
      return
  case <-newOauthCh:       // only fires when new OAuth tokens arrive
      cancel()
      wg.Wait()
  }

Since neither channel fires during normal operation, cancel() is never called. Even though all the goroutines pass a context to semaphore.Acquire, the context never cancels, so they'll wait forever.

────────────────────────────────────────────────────────────────────────────────

Possible Causes (ordered by likelihood)

1. Unbounded goroutine spawning exhausting a tiny pool ← most likely

The combination of:

  • pgxpool defaulting to only 4 connections
  • 1 held permanently by LISTEN/NOTIFY
  • 100+ goroutines all trying to DB-query simultaneously
  • No rate limiting or bounded concurrency
  • HTTP request sessions also needing DB connections

...means the 3 remaining connections are immediately exhausted. Once all connections are held, nobody can complete and release, creating a permanent deadlock.

2. Connections leaked by the BobDB / database/sql layer

The code creates a database/sql.DB via stdlib.OpenDBFromPool(db) with MaxOpenConns(0) (unlimited) and MaxIdleConns(0). The database/sql layer calls into pgxpool for each query. If database/sql holds connections open (e.g., due to a
driver bug, or a query that didn't close rows), those connections are permanently removed from the pgxpool's available slots.

3. pgxpool health check / connection lifecycle bug

pgxpool runs a background health check every minute (goroutine 38). If a connection is detected as bad by the health check and destroyed, there might be a race where the semaphore slot isn't properly released, causing a permanent drain
of available slots.

4. pond.NewResultPool(20) adding contention

exportFieldseekerData creates a pond.NewResultPoolSyncStats worker pool. Even though exportFieldseekerLayer currently returns immediately (empty body), if any goroutine path triggers DB operations through BobDB (e.g.,
InsertFieldseekerSyncs, updateSummaryTables), the 20 pond workers would also compete for DB connections.

────────────────────────────────────────────────────────────────────────────────

The Deadlock Timeline

┌─────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time │ Event │
├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 0 min │ Server starts. listenAndDoOneJob acquires 1 DB connection (holds permanently). │
├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ ~29 min │ refreshFieldseekerData launches ~90 periodicallyExportFieldseeker goroutines + ~1 maintainOAuth. All try to acquire DB connections at once. 3 get through immediately. │
├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ ~29 min │ Those 3 complete their queries and release. 3 more acquire. This cycling continues briefly. │
├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 133 min │ At some point, the 3 remaining slots get permanently occupied (likely by BobDB/database/sql operations or goroutines that don't release). All ~94 goroutines are now permanently stuck. HTTP requests start hanging. │
├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 162 min │ You dump goroutines. The deadlock is fully visible. │
└─────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

────────────────────────────────────────────────────────────────────────────────

Recommended Fixes

  1. Use a dedicated connection for LISTEN/NOTIFY — don't take a connection from the shared pool. Either use a separate single-connection pool for the listener, or increase MaxConns to account for the permanent listener.
  2. Bound the concurrency of background goroutines — instead of launching a goroutine per org, use a worker pool (e.g., pond with a max of 2-3 workers for DB operations) or a semaphore to limit concurrent DB access.
  3. Set MaxConns explicitly — pgxpool.ParseConfig(uri) returns a config; set config.MaxConns to a value that accounts for listener + session lookups + background workers. Something like config.MaxConns = 20 would be far safer.
  4. Add context timeouts — each periodicallyExportFieldseeker call should use a context with a deadline (e.g., context.WithTimeout(workerCtx, 30*time.Second)) so that DB operations can fail and retry rather than hang forever.
  5. Check BobDB/database/sql connection lifecycle — verify that connections acquired through the database/sql wrapper are released properly after each operation. Consider removing the database/sql layer entirely and using pgxpool
    directly.
Root Cause: Database Connection Pool Exhaustion ### The Mechanism (cascade of failures) 1. listenAndDoOneJob permanently holds one pool connection In /home/eliribble/src/nidus-sync/platform/start.go:173, listenAndDoOneJob calls db.PGInstance.PGXPool.Acquire(ctx) and then blocks indefinitely on conn.Conn().WaitForNotification(ctx). This connection is never released back to the pool during normal operation. Goroutine 30 shows this — it's been in IO wait for 162 minutes holding a PostgreSQL connection for LISTEN new_job. 2. refreshFieldseekerData launches unbounded goroutines In /home/eliribble/src/nidus-sync/platform/arcgis.go:77-128, refreshFieldseekerData iterates over all orgs and all OAuth tokens and launches a goroutine for each — simultaneously and with no concurrency limiting: ```go for _, oauth := range oauths { go func() { maintainOAuth(workerCtx, &oauth) }() } for _, org := range orgs { go func() { periodicallyExportFieldseeker(workerCtx, org) }() } ``` The goroutine dump shows 89 periodicallyExportFieldseeker goroutines (IDs 55–176) and 1 maintainOAuth goroutine (ID 11) — all stuck trying to acquire a DB connection for 133 minutes. 3. Every goroutine immediately hits the DB on startup periodicallyExportFieldseeker calls getOAuthForOrg → UsersFromOrganizationID → ExecuteManyTx → PGInstance.PGXPool.Query → pool.Acquire(ctx), which blocks on the puddle semaphore. maintainOAuth calls OAuthTokenFromID → ExecuteOne → PGInstance.PGXPool.Query → same. 4. The pgxpool defaults to max 4 connections pgxpool defaults to max(4, runtime.NumCPU()). With listenAndDoOneJob permanently holding 1, only 3 connections remain for all other work. 5. HTTP request session lookups also need DB connections → blocked The session store (alexedwards/scs/pgxstore) stores sessions in PostgreSQL. 9 goroutines (e.g., 1307, 1258) are HTTP request handlers stuck at pgxstore.FindCtx → pool.QueryRow → pool.Acquire. So all incoming requests hang. 6. No context cancellation to break the deadlock In refreshFieldseekerData, a workerCtx with a cancel function is created for each iteration of the loop. But cancel() is only called from select blocks at the bottom: ```go select { case <-ctx.Done(): // only fires on server shutdown cancel() wg.Wait() return case <-newOauthCh: // only fires when new OAuth tokens arrive cancel() wg.Wait() } ``` Since neither channel fires during normal operation, cancel() is never called. Even though all the goroutines pass a context to semaphore.Acquire, the context never cancels, so they'll wait forever. ──────────────────────────────────────────────────────────────────────────────── Possible Causes (ordered by likelihood) ### 1. Unbounded goroutine spawning exhausting a tiny pool ← most likely The combination of: - pgxpool defaulting to only 4 connections - 1 held permanently by LISTEN/NOTIFY - 100+ goroutines all trying to DB-query simultaneously - No rate limiting or bounded concurrency - HTTP request sessions also needing DB connections ...means the 3 remaining connections are immediately exhausted. Once all connections are held, nobody can complete and release, creating a permanent deadlock. ### 2. Connections leaked by the BobDB / database/sql layer The code creates a database/sql.DB via stdlib.OpenDBFromPool(db) with MaxOpenConns(0) (unlimited) and MaxIdleConns(0). The database/sql layer calls into pgxpool for each query. If database/sql holds connections open (e.g., due to a driver bug, or a query that didn't close rows), those connections are permanently removed from the pgxpool's available slots. ### 3. pgxpool health check / connection lifecycle bug pgxpool runs a background health check every minute (goroutine 38). If a connection is detected as bad by the health check and destroyed, there might be a race where the semaphore slot isn't properly released, causing a permanent drain of available slots. ### 4. pond.NewResultPool(20) adding contention exportFieldseekerData creates a pond.NewResultPool[SyncStats](20) worker pool. Even though exportFieldseekerLayer currently returns immediately (empty body), if any goroutine path triggers DB operations through BobDB (e.g., InsertFieldseekerSyncs, updateSummaryTables), the 20 pond workers would also compete for DB connections. ──────────────────────────────────────────────────────────────────────────────── The Deadlock Timeline ┌─────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ Time │ Event │ ├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ 0 min │ Server starts. listenAndDoOneJob acquires 1 DB connection (holds permanently). │ ├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ ~29 min │ refreshFieldseekerData launches ~90 periodicallyExportFieldseeker goroutines + ~1 maintainOAuth. All try to acquire DB connections at once. 3 get through immediately. │ ├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ ~29 min │ Those 3 complete their queries and release. 3 more acquire. This cycling continues briefly. │ ├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ 133 min │ At some point, the 3 remaining slots get permanently occupied (likely by BobDB/database/sql operations or goroutines that don't release). All ~94 goroutines are now permanently stuck. HTTP requests start hanging. │ ├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ 162 min │ You dump goroutines. The deadlock is fully visible. │ └─────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ ──────────────────────────────────────────────────────────────────────────────── Recommended Fixes 1. Use a dedicated connection for LISTEN/NOTIFY — don't take a connection from the shared pool. Either use a separate single-connection pool for the listener, or increase MaxConns to account for the permanent listener. 2. Bound the concurrency of background goroutines — instead of launching a goroutine per org, use a worker pool (e.g., pond with a max of 2-3 workers for DB operations) or a semaphore to limit concurrent DB access. 3. Set MaxConns explicitly — pgxpool.ParseConfig(uri) returns a config; set config.MaxConns to a value that accounts for listener + session lookups + background workers. Something like config.MaxConns = 20 would be far safer. 4. Add context timeouts — each periodicallyExportFieldseeker call should use a context with a deadline (e.g., context.WithTimeout(workerCtx, 30*time.Second)) so that DB operations can fail and retry rather than hang forever. 5. Check BobDB/database/sql connection lifecycle — verify that connections acquired through the database/sql wrapper are released properly after each operation. Consider removing the database/sql layer entirely and using pgxpool directly.
Author
Owner

I think this is fixed in 44c0356bcc

I think this is fixed in 44c0356bccf372a6a8ce0e3f2f8a72bd8f33d5eb
Author
Owner

Started hitting this again on staging just now. Grabbed a pprof-3.log. Need to analyze it.

Started hitting this again on staging just now. Grabbed a pprof-3.log. Need to analyze it.
Author
Owner

I hit it again, almost immediately after restarting the server on staging:

May 25 01:31:48 nocix-amd-legacy-quadcore systemd[1]: Stopped Nidus Sync Webserver.
May 25 01:31:48 nocix-amd-legacy-quadcore systemd[1]: nidus-sync-webserver.service: Consumed 10.247s CPU time, 161.1M memory peak, 1.6M written to disk, 304.9K incoming IP traffic, 1.3M outgoing IP traffic.
May 25 01:31:48 nocix-amd-legacy-quadcore systemd[1]: Started Nidus Sync Webserver.
May 25 01:31:48 nocix-amd-legacy-quadcore nidus-sync[965359]: {"level":"info","environment":"PRODUCTION","is-prod":true,"revision":"1f32ba6a5c138a31b77e955be913985101b36a97","build_time":"2026-05-24 23:09:45 +0000 UTC","is modified":false,"Version":"2026-05-24.1f32ba6","Revision":"1f32ba6a5c138a31b77e955be913985101b36a97","BuildDate":"20260524230945","time":"2026-05-25T01:31:48Z","message":"Starting"}
May 25 01:31:48 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:31AM INF Connected to database dsn="PostgreSQL 17.9 on x86_64-pc-linux-gnu, compiled by clang version 21.1.7, 64-bit"
May 25 01:31:49 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:31AM INF Starting pprof debug server address=localhost:6060
May 25 01:31:49 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:31AM INF Connection to email websocket begun
May 25 01:31:49 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:31AM INF Connection to email websocket established alias=6995512187571f372f032ddc
May 25 01:31:50 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:31AM INF Serving HTTP requests address=127.0.0.1:10000
May 25 01:32:52 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:32AM INF extracted nuisance uploads len=0
May 25 01:32:52 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:32AM INF Created base report id=164 public_id=NUCUY6BXBRST
May 25 01:33:17 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:33AM INF extracted nuisance uploads len=0
May 25 01:33:52 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:33AM WRN Failed to match district error="location for address: query address: execute query: context canceled"
May 25 01:33:52 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:33AM ERR rollback error="timeout: context already done: context canceled"
May 25 01:33:52 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:33AM WRN Responding with an error from api error="create nuisance report: contact empty: querycomms: query contact: execute query: context canceled" status=500 user message="create nuisance report: contact empty: querycomms: query contact: execute query: context canceled"
May 25 01:34:50 nocix-amd-legacy-quadcore nidus-sync[965359]: 2026/05/25 01:34:50 context canceled
May 25 01:34:57 nocix-amd-legacy-quadcore nidus-sync[965359]: 2026/05/25 01:34:57 context canceled
May 25 01:35:36 nocix-amd-legacy-quadcore nidus-sync[965359]: 2026/05/25 01:35:36 context canceled
May 25 01:35:38 nocix-amd-legacy-quadcore nidus-sync[965359]: 2026/05/25 01:35:38 context canceled

I took pprof-4.log, need to analyze that one too.

I hit it again, almost immediately after restarting the server on staging: ``` May 25 01:31:48 nocix-amd-legacy-quadcore systemd[1]: Stopped Nidus Sync Webserver. May 25 01:31:48 nocix-amd-legacy-quadcore systemd[1]: nidus-sync-webserver.service: Consumed 10.247s CPU time, 161.1M memory peak, 1.6M written to disk, 304.9K incoming IP traffic, 1.3M outgoing IP traffic. May 25 01:31:48 nocix-amd-legacy-quadcore systemd[1]: Started Nidus Sync Webserver. May 25 01:31:48 nocix-amd-legacy-quadcore nidus-sync[965359]: {"level":"info","environment":"PRODUCTION","is-prod":true,"revision":"1f32ba6a5c138a31b77e955be913985101b36a97","build_time":"2026-05-24 23:09:45 +0000 UTC","is modified":false,"Version":"2026-05-24.1f32ba6","Revision":"1f32ba6a5c138a31b77e955be913985101b36a97","BuildDate":"20260524230945","time":"2026-05-25T01:31:48Z","message":"Starting"} May 25 01:31:48 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:31AM INF Connected to database dsn="PostgreSQL 17.9 on x86_64-pc-linux-gnu, compiled by clang version 21.1.7, 64-bit" May 25 01:31:49 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:31AM INF Starting pprof debug server address=localhost:6060 May 25 01:31:49 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:31AM INF Connection to email websocket begun May 25 01:31:49 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:31AM INF Connection to email websocket established alias=6995512187571f372f032ddc May 25 01:31:50 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:31AM INF Serving HTTP requests address=127.0.0.1:10000 May 25 01:32:52 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:32AM INF extracted nuisance uploads len=0 May 25 01:32:52 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:32AM INF Created base report id=164 public_id=NUCUY6BXBRST May 25 01:33:17 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:33AM INF extracted nuisance uploads len=0 May 25 01:33:52 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:33AM WRN Failed to match district error="location for address: query address: execute query: context canceled" May 25 01:33:52 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:33AM ERR rollback error="timeout: context already done: context canceled" May 25 01:33:52 nocix-amd-legacy-quadcore nidus-sync[965359]: 1:33AM WRN Responding with an error from api error="create nuisance report: contact empty: querycomms: query contact: execute query: context canceled" status=500 user message="create nuisance report: contact empty: querycomms: query contact: execute query: context canceled" May 25 01:34:50 nocix-amd-legacy-quadcore nidus-sync[965359]: 2026/05/25 01:34:50 context canceled May 25 01:34:57 nocix-amd-legacy-quadcore nidus-sync[965359]: 2026/05/25 01:34:57 context canceled May 25 01:35:36 nocix-amd-legacy-quadcore nidus-sync[965359]: 2026/05/25 01:35:36 context canceled May 25 01:35:38 nocix-amd-legacy-quadcore nidus-sync[965359]: 2026/05/25 01:35:38 context canceled ``` I took pprof-4.log, need to analyze that one too.
Author
Owner

Tzipporah wedged the server again, she says by making a nuisance report, typing in an address "pointe noire department" which is a POI and selecting it, not waiting for it to populate, and then submitting the report.

The staging log since bootup says:

May 25 01:37:43 nocix-amd-legacy-quadcore systemd[1]: Stopped Nidus Sync Webserver.
May 25 01:37:43 nocix-amd-legacy-quadcore systemd[1]: nidus-sync-webserver.service: Consumed 1.986s CPU time, 144.8M memory peak, 72.5K incoming IP traffic, 311.6K outgoing IP traffic.
May 25 01:37:43 nocix-amd-legacy-quadcore systemd[1]: Started Nidus Sync Webserver.
May 25 01:37:43 nocix-amd-legacy-quadcore nidus-sync[965490]: {"level":"info","environment":"PRODUCTION","is-prod":true,"revision":"1f32ba6a5c138a31b77e955be913985101b36a97","build_time":"2026-05-24 23:09:45 +0000 UTC","is modified":false,"Version":"2026-05-24.1f32ba6","Revision":"1f32ba6a5c138a31b77e955be913985101b36a97","BuildDate":"20260524230945","time":"2026-05-25T01:37:43Z","message":"Starting"}
May 25 01:37:43 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:37AM INF Connected to database dsn="PostgreSQL 17.9 on x86_64-pc-linux-gnu, compiled by clang version 21.1.7, 64-bit"
May 25 01:37:44 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:37AM INF Starting pprof debug server address=localhost:6060
May 25 01:37:44 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:37AM INF Connection to email websocket begun
May 25 01:37:44 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:37AM INF Connection to email websocket established alias=6995512187571f372f032ddc
May 25 01:37:45 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:37AM INF Serving HTTP requests address=127.0.0.1:10000
May 25 01:38:29 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:38AM INF extracted nuisance uploads len=0
May 25 01:38:29 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:38AM INF Created base report id=165 public_id=6QATPX49T3K7
May 25 01:38:37 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:38AM INF extracted nuisance uploads len=0
May 25 01:38:37 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:38AM INF Created base report id=166 public_id=BG5A6UX35WS7
May 25 01:38:45 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:38AM INF extracted nuisance uploads len=0
May 25 01:38:45 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:38AM INF Created base report id=167 public_id=7NKX3FNNRG5V
May 25 01:39:58 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:39AM WRN Responding with an error from api error="bygid: insert address: query address from gid: execute one: txn query: context deadline exceeded" status=500 user message="bygid: insert address: query address from gid: execute one: txn query: context deadline exceeded"
Tzipporah wedged the server again, she says by making a nuisance report, typing in an address "pointe noire department" which is a POI and selecting it, not waiting for it to populate, and then submitting the report. The staging log since bootup says: ``` May 25 01:37:43 nocix-amd-legacy-quadcore systemd[1]: Stopped Nidus Sync Webserver. May 25 01:37:43 nocix-amd-legacy-quadcore systemd[1]: nidus-sync-webserver.service: Consumed 1.986s CPU time, 144.8M memory peak, 72.5K incoming IP traffic, 311.6K outgoing IP traffic. May 25 01:37:43 nocix-amd-legacy-quadcore systemd[1]: Started Nidus Sync Webserver. May 25 01:37:43 nocix-amd-legacy-quadcore nidus-sync[965490]: {"level":"info","environment":"PRODUCTION","is-prod":true,"revision":"1f32ba6a5c138a31b77e955be913985101b36a97","build_time":"2026-05-24 23:09:45 +0000 UTC","is modified":false,"Version":"2026-05-24.1f32ba6","Revision":"1f32ba6a5c138a31b77e955be913985101b36a97","BuildDate":"20260524230945","time":"2026-05-25T01:37:43Z","message":"Starting"} May 25 01:37:43 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:37AM INF Connected to database dsn="PostgreSQL 17.9 on x86_64-pc-linux-gnu, compiled by clang version 21.1.7, 64-bit" May 25 01:37:44 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:37AM INF Starting pprof debug server address=localhost:6060 May 25 01:37:44 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:37AM INF Connection to email websocket begun May 25 01:37:44 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:37AM INF Connection to email websocket established alias=6995512187571f372f032ddc May 25 01:37:45 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:37AM INF Serving HTTP requests address=127.0.0.1:10000 May 25 01:38:29 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:38AM INF extracted nuisance uploads len=0 May 25 01:38:29 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:38AM INF Created base report id=165 public_id=6QATPX49T3K7 May 25 01:38:37 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:38AM INF extracted nuisance uploads len=0 May 25 01:38:37 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:38AM INF Created base report id=166 public_id=BG5A6UX35WS7 May 25 01:38:45 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:38AM INF extracted nuisance uploads len=0 May 25 01:38:45 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:38AM INF Created base report id=167 public_id=7NKX3FNNRG5V May 25 01:39:58 nocix-amd-legacy-quadcore nidus-sync[965490]: 1:39AM WRN Responding with an error from api error="bygid: insert address: query address from gid: execute one: txn query: context deadline exceeded" status=500 user message="bygid: insert address: query address from gid: execute one: txn query: context deadline exceeded" ```
eliribble self-assigned this 2026-05-25 01:41:54 +00:00
Author
Owner

I also created pprof-5.log for this one.

I also created pprof-5.log for this one.
Author
Owner

Attached video of Tzipporah reproducing the issue after a reload in staging. The log at the time is:

May 27 22:51:50 nocix-amd-legacy-quadcore systemd[1]: Started Nidus Sync Webserver.
May 27 22:51:50 nocix-amd-legacy-quadcore nidus-sync[1029099]: {"level":"info","environment":"PRODUCTION","is-prod":true,"revision":"1f32ba6a5c138a31b77e955be913985101b36a97","build_time":"2026-05-24 23:09:45 +0000 UTC","is modified":false,"Version":"2026-05-24.1f32ba6","Revision":"1f32ba6a5c138a31b77e955be913985101b36a97","BuildDate":"20260524230945","time":"2026-05-27T22:51:50Z","message":"Starting"}
May 27 22:51:50 nocix-amd-legacy-quadcore nidus-sync[1029099]: 10:51PM INF Connected to database dsn="PostgreSQL 17.9 on x86_64-pc-linux-gnu, compiled by clang version 21.1.7, 64-bit"
May 27 22:51:51 nocix-amd-legacy-quadcore nidus-sync[1029099]: 10:51PM INF Starting pprof debug server address=localhost:6060
May 27 22:51:51 nocix-amd-legacy-quadcore nidus-sync[1029099]: 10:51PM INF Connection to email websocket begun
May 27 22:51:51 nocix-amd-legacy-quadcore nidus-sync[1029099]: 10:51PM INF Connection to email websocket established alias=6995512187571f372f032ddc
May 27 22:51:52 nocix-amd-legacy-quadcore nidus-sync[1029099]: 10:51PM INF Serving HTTP requests address=127.0.0.1:10000

and I captured pprof-7.log at the time

Attached video of Tzipporah reproducing the issue after a reload in staging. The log at the time is: ``` May 27 22:51:50 nocix-amd-legacy-quadcore systemd[1]: Started Nidus Sync Webserver. May 27 22:51:50 nocix-amd-legacy-quadcore nidus-sync[1029099]: {"level":"info","environment":"PRODUCTION","is-prod":true,"revision":"1f32ba6a5c138a31b77e955be913985101b36a97","build_time":"2026-05-24 23:09:45 +0000 UTC","is modified":false,"Version":"2026-05-24.1f32ba6","Revision":"1f32ba6a5c138a31b77e955be913985101b36a97","BuildDate":"20260524230945","time":"2026-05-27T22:51:50Z","message":"Starting"} May 27 22:51:50 nocix-amd-legacy-quadcore nidus-sync[1029099]: 10:51PM INF Connected to database dsn="PostgreSQL 17.9 on x86_64-pc-linux-gnu, compiled by clang version 21.1.7, 64-bit" May 27 22:51:51 nocix-amd-legacy-quadcore nidus-sync[1029099]: 10:51PM INF Starting pprof debug server address=localhost:6060 May 27 22:51:51 nocix-amd-legacy-quadcore nidus-sync[1029099]: 10:51PM INF Connection to email websocket begun May 27 22:51:51 nocix-amd-legacy-quadcore nidus-sync[1029099]: 10:51PM INF Connection to email websocket established alias=6995512187571f372f032ddc May 27 22:51:52 nocix-amd-legacy-quadcore nidus-sync[1029099]: 10:51PM INF Serving HTTP requests address=127.0.0.1:10000 ``` and I captured pprof-7.log at the time
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Gleipnir/nidus-sync#3
No description provided.