Skip to content

Stop release jobs hanging 6h on failure (gate tmate, add timeouts)#21

Merged
jcschaff merged 1 commit into
mainfrom
fix-release-tmate-hang
Jun 24, 2026
Merged

Stop release jobs hanging 6h on failure (gate tmate, add timeouts)#21
jcschaff merged 1 commit into
mainfrom
fix-release-tmate-hang

Conversation

@jcschaff

Copy link
Copy Markdown
Member

Problem

The 0.0.16 release run finished as cancelled after 6h17m. Root cause: the Setup tmate step runs if: failure(), opening an interactive debug session that blocks the job until the 6h runner limit. When the PyPI publish step failed (PyPI 10GB project-size limit exceeded), tmate then held 7 jobs open for 6 hours each until the whole run hit the cap and was cancelled.

Fix

  • Gate the tmate step behind a workflow_dispatch boolean input debug_tmate (default false). On a release event the input is false, so a failed job fails fast instead of hanging. The session is still available when you manually run the workflow with the input enabled.
  • Bound the tmate step with timeout-minutes: 30.
  • Job-level timeout-minutes: 180 as defense-in-depth (successful builds take ~35–60 min).

Not addressed here

The underlying publish failure (PyPI 10GB storage limit) is separate — being handled by pruning old releases. This PR just ensures a publish failure surfaces immediately instead of burning 6h × N jobs.

🤖 Generated with Claude Code

The release-main workflow ran a `Setup tmate` step on any failure
(`if: failure()`), which opens an interactive debug session that blocks the job
until the 6h runner limit. During the 0.0.16 release the PyPI publish step
failed (project size > 10GB), and tmate then held 7 jobs open for 6h each until
the whole run was cancelled.

- Gate the tmate step behind a `workflow_dispatch` boolean input `debug_tmate`
  (default false), so release events never open a session and failures fail
  fast; the session is still available for manual debug runs.
- Bound the tmate step with `timeout-minutes: 30`.
- Add a job-level `timeout-minutes: 180` as defense-in-depth (successful builds
  take ~35-60 min).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jcschaff jcschaff merged commit 87347be into main Jun 24, 2026
6 checks passed
@jcschaff jcschaff deleted the fix-release-tmate-hang branch June 24, 2026 12:26
@jcschaff jcschaff mentioned this pull request Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant