Known Problems
Contents
Known Problems#
This page contains some known issues with the FirefoxCI cluster, what they’re symptoms are and how to resolve them.
Actions are broken after modifying .taskcluster.yml
#
Anytime someone modifies .taskcluster.yml
, the hooks need to be
re-generated since they depend on the hash of this file.
Symptoms#
Action tasks will fail to run. Sheriffs will likely be the first to notice this and will close the trees due to retriggers and backfills not working.
Solution#
Re-run ci-admin
. The easiest way is to land a change to the ci-configuration
repo, though it can also be done manually (see CI-Admin). Upon
deployment complete in Jenkins (see #releng-notifications), actions will be working
again.
History#
We wanted to avoid action hooks having an intermediary task, a la:
hook ->
intermediary task clones repo, triggers hook task through .taskcluster.yml ->
hook task runs the desired action
Instead, we went with:
hash-named hook with .taskcluster.yml baked in ->
hook task runs the desired action
This means that whenever .taskcluster.yml changes, we need to rebuild the hooks.
Missing Scopes after Branch Rename#
Some repos have specific scopes associated with a named branch, so e.g if we’re
changing master -> main
, there may be failures.
Symptoms#
Tasks start failing due to scopes errors.
Solution#
Scan the ci-configuration repo for your project and see if we were granting any special scopes to the old branch. If so, update the name to the new branch and land.
Missing Mercurial features when Cloning the Repo#
Symptoms#
After updating the version of Mercurial used on newer branches, tasks on older branches might start failing with errors like:
abort: repository requires features unknown to this Mercurial: revlog-compression-zstd!
This happens because repos cloned via newer versions of Mercurial are often incompatible with older versions of Mercurial. Since workers have a shared checkout cache between tasks, if a worker first claims a task on e.g, mozilla-central, clones the repo, and then claims a task on mozilla-release, the latter task might fail if it is using an older Mercurial.
Solution#
The solution is to increment the cache identifier associated with the checkout cache. For example, the Gecko decision tasks use this cache name. Changing the name (e.g incrementing v2 -> v3), will ensure workers don’t re-use the same cache across disparate branches.
The downside to doing this is that tasks on older branches with less traffic will have more cache misses, resulting in longer runtimes which could impact our ability to ship expediently. To mitigate this, consider backporting the image bump that caused the Mercurial upgrade to beta, release and esr branches.
Workers are spinning up slowly#
First, see Troubleshooting Workers.
Second, as of 2022.01.26, we have had a number of issues with worker-manager. There is a single process that goes through worker pool by worker pool to spin workers up and down on demand. The Azure workers, in particular, take a long time to spin up and down, and these processes block spinning other pools up or down. There’s not much we can do here except adjust our idle times, which isn’t the ideal solution. Otherwise we wait for the Taskcluster team to fix the issues:
Bug 1735411 - windows 2004 test worker backlog (gecko-t/win10-64-2004)
Bug 1741946 - Investigate the best way to go about the Windows 10 Azure worker delays
RFC issue UI to visualize and reprioritize scheduled+pending tasks
RFC issue Revisit worker idle time shutdown
(marked as dup) Investigate running multiple worker-managers
worker-manager: Azure workers register when state != REQUESTED
Combine worker-manager provisioner and worker scanner in to a single processes
Provide worker counts and capacity by state for worker pools
Measure and improve performance of the worker query in provisioning loop
Workers Not Spawning After Image Bustage#
If there’s a problem in a worker image, worker-manager may not spawn any new workers even after the issue is fixed. This happens because the workers with the problematic image are still running, even though they are unable to claim tasks. However, worker-manager doesn’t know this, so won’t spawn any new workers until the broken ones expire or are terminated.
These problematic workers won’t show up in the Taskcluster Web UI, as the queue service is unaware of workers until they claim a task.
Symptoms#
Backlogs will persist even after fixing a worker image. This will be most noticeable on pools with a low max capacity (like Decision pools), as they are more likely to get entirely filled with broken workers (in which cases no further tasks would run).
Solution#
Run this script in braindump to automatically scan for and terminate these broken workers: https://hg.mozilla.org/build/braindump/file/tip/taskcluster/terminate_broken_workers.py
push(MSIX) fails: “push to Store aborted: pending submission found”#
pushmsixscript pushes Firefox to the Microsoft Store. The Store rejects any new submission if there is a pending submission (one which has been uploaded but not yet released). Release Management has asked that pushmsixscript not delete pending submissions, in case that pending submission was created manually.
Symptoms#
The push(MSIX) task fails with Exception status. The task log shows “push to Store aborted: pending submission found” and “ERROR - There is a pending submission for this application on the Microsoft Store. Wait for the pending submission to complete, or delete the pending submission. Then retry this task.”
Solution#
Delete the pending submission from the Store manually; Release Management has access. Once the pending submission has been deleted, re-run the failed push(MSIX) task.