Skip to content

Fix/18 auto continue on usage reset #32

Open
irmorteza wants to merge 7 commits into
Neokil:mainfrom
irmorteza:fix/18-auto-continue-on-usage-reset
Open

Fix/18 auto continue on usage reset #32
irmorteza wants to merge 7 commits into
Neokil:mainfrom
irmorteza:fix/18-auto-continue-on-usage-reset

Conversation

@irmorteza

Copy link
Copy Markdown
Contributor

Summary

  • add quota_monitor.go: background daemon tracks quotaReached flag and probes the provider every 15 minutes to detect when quota has reset
  • block workers via waitIfQuotaReached() when quota is hit and resume automatically
  • re-queue jobs on ErrTokensExhausted instead of marking them failed
  • add ProbeProvider() to Orchestrator for lightweight LLM ping
  • add FlowStatusRescheduled status for tickets waiting on quota reset
  • add rescheduled to OpenAPI schema for future WebUI support

Testing

  • go test ./...
  • set provider to bash -c 'echo "usage limit reached" >&2; exit 1' and run a ticket
  • observe ticket status changes to rescheduled and job re-queues automatically
  • after 20 minutes observe quota monitor probes and resumes workers

~/.auto-pr/config.yaml

provider: fake-exhausted

providers:
  fake-exhausted:
    command: bash
    args:
      - -c
      - 'echo "usage limit reached" >&2; exit 1'

Notes

  • system jobs (cleanup) blocked during quota pause will be addressed in a follow-up issue

Known Limitations

when a ticket is waiting for quota reset, all workers are paused including system jobs (cleanup, cleanup-all). this will be fixed in a follow-up issue by splitting the job queue into llmJobs and systemJobs channels with dedicated worker pools.

Comment thread internal/server/jobs.go Outdated
persistErr := s.persistTicketFailure(repoID, repoRoot, ticket, repoRt, job, err)
if persistErr != nil {
return fmt.Errorf("%w (also failed to persist ticket failure: %w)", err, persistErr)
if errors.Is(err, providers.ErrTokensExhausted) {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already handled inside of the StartFlow -> runState or MoveToState -> transitionTo -> runState so the ticket is updated twice

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply — I was busy last week.
If you mean the persistTicketScheduled that I used a couple of lines above this block — you're right, I found it wasn't correct and I removed it. But if you're pointing to persistTicketFailure, that one already existed. By the way, I double-checked it and I don't think they're redundant — they operate at two different layers with different responsibilities.

One more thing: I think this block should check if !errors.Is(err, providers.ErrTokensExhausted), because a quota error isn't exactly an error state. In that case it should return to workerLoop so the requeueing is handled by the top-level caller.
It might even be a good idea to handle quota as a non-error across the whole process.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might even be a good idea to handle quota as a non-error across the whole process.

So you mean it is more of a state for the provider interface that needs to be checked in the workerLoop and can be handled there instead of bubbling up as an error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, your point about the Provider is perfect. quota reach is just the window has already closed, and jobs should wait till opens again.
Oncourse, you know how do handle it, but let me a little more work on this idea tomorrow. I'll be back soon

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.As mentioned following quota status is the provider property, so it's nice to handle it by provider too
2.Current approach still work in single provider running.
3.Current quota reached is shared among all provider, If the infrastructure allow to run multi provider, so there no way just it should handle in provider level.
4.One needed also to mentioned is, Flow go through workerLoop->executeJob->StartFlow->runState->o.Provider.Execute. the status of the job or ticket or state could be change in one of following step, but the only data is return is Error, and i think it cant to carry all status by just an error. (suppose an http request that have vary range of response). so it would not be scalable for future demand too.
So i think i would be nice to return an struct of ProviderResult to allow flexible-handlng in all levels

struct ProviderResult{
Stderr error
QuotaReached bool
RawOutput string
....
}

I saw there is 'ExecuteResult' and we could use it or having something new like

struct SampleResult{
ExecuteResult <- QuotaReached defined in ExecuteResult
....
}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like it could make it all much more flexible, lets do it

@Neokil Neokil left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment on duplicate state saves and also please execute the golangi-lint run and make sure the issues (like empty lines) are resolved

@irmorteza irmorteza marked this pull request as draft June 18, 2026 16:21
@irmorteza irmorteza marked this pull request as ready for review June 18, 2026 22:57
@irmorteza irmorteza requested a review from Neokil June 18, 2026 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants