Troubleshooting Workers

(This page needs fleshing out. Ideally we have a runbook where we can go from symptom to solution.)

Active workers

We can see recent worker activity in the provisioners UI. For instance, if we’re looking at gecko-3/b-linux, we could go here. Click on the columns to change sorting order. Using the task started column can give you an idea of what spot instance workers are still around and running tasks.

Drilling down into the workers themselves, e.g. with mac-v3-signing2, allows you to see the status of recent tasks that ran on that worker. This can give you a better idea if a single worker is busted, or if it’s across the pool or multiple pools.

Worker Manager errors

The worker manager UI lists the various pools. It shows current capacity and pending tasks, as well as links to view the workers and recent worker manager error messages, for each pool.

Drilling down into each workerType should show you the config. Generally this will be an expanded version of the ci-configuration worker-pools config.

Common types of error messages

Instance Creation Error: Error calling AWS API: There is no Spot capacity available that matches your request.

We get emails about this. This is also visible in the Worker Manager UI.

This means we’ve run out of spot instance capacity in a given region. A single instance of this is generally informative, not actionable. Many instances of this error message, combined with too many pending tasks and not enough capacity, may mean we need to adjust instance types or regions or the like.

Instance Creation Error: Error calling AWS API: We currently do not have sufficient $INSTANCE_TYPE capacity in the Availability Zone you requested ($AVAILABILITY_ZONE). Our system will be working on provisioning additional capacity. You can currently get $INSTANCE_TYPE capacity by not specifying an Availability Zone in your request or choosing $ALTERNATE_AVAILABILITY_ZONES.

We get emails about this. This is also visible in the Worker Manager UI.

This means this instanceType is not supported in this availability zone at all. We choose which availability zone to use randomly and retry on failure, so this is probably not fatal, but it can be noisy.

To stop getting this error message, we can mark a given instanceType family as invalid in a given availability zone here.

Quarantining workers

Go to the worker view, e.g. mac-v3-signing2. Make sure you’re logged in (top right). Click on the three dots in the lower right hand corner. Choose a date to quarantine until (1000 years in the future is generally ok as long as we unquarantine hardware workers after they’re ready to be put back into the pool), and quarantine it. The machine should stop taking tasks after the current task resolves.

Unquarantining workers

Go to the worker view, e.g. mac-v3-signing2. Make sure you’re logged in (top right). Click on the three dots in the lower right hand corner and update the quarantine. Choose a date in the past and update the quarantine date. The worker should start claiming tasks within a minute or so if it’s running and any tasks are pending.