This page contains some known issues with the FirefoxCI cluster, what they’re symptoms are and how to resolve them.
Actions are broken after modifying
Anytime someone modifies
.taskcluster.yml, the hooks need to be
re-generated since they depend on the hash of this file.
Action tasks will fail to run. Sheriffs will likely be the first to notice this and will close the trees due to retriggers and backfills not working.
ci-admin. The easiest way is to land a change to the ci-configuration
repo, though it can also be done manually (see CI-Admin). Upon
deployment complete in Jenkins (see #releng-notifications), actions will be working
We wanted to avoid action hooks having an intermediary task, a la:
hook -> intermediary task clones repo, triggers hook task through .taskcluster.yml -> hook task runs the desired action
Instead, we went with:
hash-named hook with .taskcluster.yml baked in -> hook task runs the desired action
This means that whenever .taskcluster.yml changes, we need to rebuild the hooks.
Missing Scopes after Branch Rename#
Some repos have specific scopes associated with a named branch, so e.g if we’re
master -> main, there may be failures.
Tasks start failing due to scopes errors.
Scan the ci-configuration repo for your project and see if we were granting any special scopes to the old branch. If so, update the name to the new branch and land.
Missing Mercurial features when Cloning the Repo#
After updating the version of Mercurial used on newer branches, tasks on older branches might start failing with errors like:
abort: repository requires features unknown to this Mercurial: revlog-compression-zstd!
This happens because repos cloned via newer versions of Mercurial are often incompatible with older versions of Mercurial. Since workers have a shared checkout cache between tasks, if a worker first claims a task on e.g, mozilla-central, clones the repo, and then claims a task on mozilla-release, the latter task might fail if it is using an older Mercurial.
The solution is to increment the cache identifier associated with the checkout cache. For example, the Gecko decision tasks use this cache name. Changing the name (e.g incrementing v2 -> v3), will ensure workers don’t re-use the same cache across disparate branches.
The downside to doing this is that tasks on older branches with less traffic will have more cache misses, resulting in longer runtimes which could impact our ability to ship expediently. To mitigate this, consider backporting the image bump that caused the Mercurial upgrade to beta, release and esr branches.
Workers are spinning up slowly#
First, see Troubleshooting Workers.
Second, as of 2022.01.26, we have had a number of issues with worker-manager. There is a single process that goes through worker pool by worker pool to spin workers up and down on demand. The Azure workers, in particular, take a long time to spin up and down, and these processes block spinning other pools up or down. There’s not much we can do here except adjust our idle times, which isn’t the ideal solution. Otherwise we wait for the Taskcluster team to fix the issues:
RFC issue Revisit worker idle time shutdown
(marked as dup) Investigate running multiple worker-managers
Workers Not Spawning After Image Bustage#
If there’s a problem in a worker image, worker-manager may not spawn any new workers even after the issue is fixed. This happens because the workers with the problematic image are still running, even though they are unable to claim tasks. However, worker-manager doesn’t know this, so won’t spawn any new workers until the broken ones expire or are terminated.
These problematic workers won’t show up in the Taskcluster Web UI, as the queue service is unaware of workers until they claim a task.
Backlogs will persist even after fixing a worker image. This will be most noticeable on pools with a low max capacity (like Decision pools), as they are more likely to get entirely filled with broken workers (in which cases no further tasks would run).
Run this script in braindump to automatically scan for and terminate these broken workers: https://hg.mozilla.org/build/braindump/file/tip/taskcluster/terminate_broken_workers.py